Before diving into causal estimation, a critical but often overlooked step is data auditing: systematically checking which variables in your dataset might introduce bias, confounding, or estimation problems.
The audit_data() function in causaldef automates this process by testing each variable against the treatment and outcome to classify its causal role.
Traditional exploratory data analysis (EDA) tools check for: - Missing values - Distributional skew - Outliers
But they miss causal validity issues: - Which variables are confounders that MUST be adjusted for? - Which variables are potential instruments? - Which variables might serve as negative controls? - Which variables are “leaky” and could bias your analysis?
Based on the manuscript’s negative control certification/bounding logic (thm:nc_bound), audit_data() systematically evaluates each variable.
We’ll demonstrate data auditing using the classic RHC dataset from Connors et al. (1996). This dataset contains 5,735 critically ill patients from 5 medical centers, with the treatment being Right Heart Catheterization (RHC) and the outcome being 30-day mortality.
This is an ideal case study because: - Medium size: n = 5,735 patients - Many covariates: p = 63 variables - Real confounding concerns: RHC is not randomly assigned - Clinical importance: Used extensively in causal inference literature
library(causaldef)
# Load the RHC dataset
data(rhc)
cat("Dataset dimensions:", nrow(rhc), "patients,", ncol(rhc), "variables\n")
#> Dataset dimensions: 5735 patients, 63 variablesTreatment: swang1 - Whether the patient received Right Heart Catheterization (1 = Yes, 0 = No)
Outcome: death - 30-day mortality (Yes/No)
Covariates: Demographics, disease category, vital signs, lab values, comorbidities, etc.
Let’s audit all available variables to understand which ones are most relevant for causal analysis.
# Prepare data - convert death to numeric for auditing
rhc_clean <- rhc
rhc_clean$death_num <- as.numeric(rhc_clean$death == "Yes")
# Select relevant numeric and factor columns for audit
# (excluding IDs, dates, and the outcome/treatment themselves)
exclude_cols <- c("X", "ptid", "sadmdte", "dschdte", "dthdte", "lstctdte",
"swang1", "death", "death_num")
audit_cols <- setdiff(names(rhc_clean), exclude_cols)
# Run the audit
report <- audit_data(
data = rhc_clean,
treatment = "swang1",
outcome = "death_num",
covariates = audit_cols[1:25], # First 25 covariates for demonstration
alpha = 0.01, # Stricter significance level
verbose = FALSE
)
print(report)
#>
#> ==============================================================================
#> Data Integrity Report
#> ==============================================================================
#>
#> Treatment: swang1 | Outcome: death_num
#> Variables audited: 25
#> Issues found: 22
#>
#> -- Issues Detected --------------------------------------------------
#>
#> Variable Type p-value Recommendation
#> cat1 Confounder 8.91e-14 Include in adjustment set
#> cat2 Confounder 0.000777 Include in adjustment set
#> cardiohx Confounder 0.001571 Include in adjustment set
#> chfhx Outcome Predictor 2.23e-06 May improve precision if included
#> dementhx Confounder 5.48e-09 Include in adjustment set
#> psychhx Confounder 0.004839 Include in adjustment set
#> chrpulhx Potential Instrument 4.32e-12 Consider using as instrumental variable
#> liverhx Outcome Predictor 0.007240 May improve precision if included
#> malighx Confounder 0.000217 Include in adjustment set
#> immunhx Confounder 0.002996 Include in adjustment set
#> transhx Confounder 9.21e-07 Include in adjustment set
#> amihx Potential Instrument 0.005233 Consider using as instrumental variable
#> age Outcome Predictor < 2e-16 May improve precision if included
#> sex Potential Instrument 0.000631 Consider using as instrumental variable
#> edu Potential Instrument 0.000776 Consider using as instrumental variable
#> surv2md1 Confounder 2.67e-13 Include in adjustment set
#> das2d3pc Outcome Predictor < 2e-16 May improve precision if included
#> t3d30 Confounder 2.69e-08 Include in adjustment set
#> dth30 Confounder 9.08e-09 Include in adjustment set
#> aps1 Confounder < 2e-16 Include in adjustment set
#> scoma1 Confounder 6.69e-05 Include in adjustment set
#> meanbp1 Confounder 3.23e-15 Include in adjustment set
#>
#> -- Recommendations --------------------------------------------------
#>
#> * CONFOUNDERS: Variables [cat1, cat2, cardiohx, dementhx, psychhx, malighx, immunhx, transhx, surv2md1, t3d30, dth30, aps1, scoma1, meanbp1] correlate with both treatment and outcome - must adjust for these
#> * INSTRUMENTS: Variables [chrpulhx, amihx, sex, edu] correlate with treatment but not outcome - consider as IVsThe audit classifies each variable into one of these categories:
| Classification | Meaning | Action |
|---|---|---|
| Confounder | Correlates with BOTH treatment and outcome | MUST adjust for this |
| Potential Instrument | Correlates with treatment but NOT outcome | Consider for IV analysis |
| Outcome Predictor | Correlates with outcome but NOT treatment | Include for precision |
| Safe | No significant correlations | Can include or exclude |
Let’s look more closely at the flagged confounders:
# Filter to see only confounders
confounders <- report$issues[report$issues$issue_type == "Confounder", ]
if (nrow(confounders) > 0) {
cat("Detected Confounders (must adjust for these):\n\n")
print(confounders[, c("variable", "r_treatment", "r_outcome", "p_value")])
}
#> Detected Confounders (must adjust for these):
#>
#> variable r_treatment r_outcome p_value
#> 1 cat1 0.10083106 0.09823951 8.913722e-14
#> 2 cat2 0.18615863 -0.09688936 7.772900e-04
#> 4 cardiohx 0.05671208 0.04173494 1.570847e-03
#> 6 dementhx -0.07691385 0.08093069 5.475391e-09
#> 7 psychhx -0.06735438 -0.03720101 4.838619e-03
#> 12 malighx -0.04881156 0.18325255 2.174017e-04
#> 13 immunhx 0.03918708 0.05980653 2.996256e-03
#> 14 transhx 0.08416616 -0.06475250 9.211552e-07
#> 19 surv2md1 -0.09632573 -0.34610143 2.668645e-13
#> 21 t3d30 -0.07334547 -0.46736750 2.686694e-08
#> 22 dth30 0.07579735 0.52131104 9.076865e-09
#> 23 aps1 0.23861164 0.19205247 8.950783e-49
#> 24 scoma1 -0.05262318 0.12494518 6.690392e-05
#> 25 meanbp1 -0.21278822 -0.10381577 3.233814e-15These variables show significant correlation with both the treatment decision (whether a patient receives RHC) and the outcome (mortality). Failing to adjust for these would introduce confounding bias.
The audit results make clinical sense:
aps1, vital signs) should correlate with both:
Demographics (age, comorbidities) follow similar patterns
Some variables correlate only with outcome (predictors of mortality but not of treatment selection)
Let’s see how the audit differs across patient subgroups:
# Audit cardiac patients only
cardiac_patients <- rhc_clean[rhc_clean$card == 1, ]
if (nrow(cardiac_patients) > 50) {
report_cardiac <- audit_data(
data = cardiac_patients,
treatment = "swang1",
outcome = "death_num",
covariates = audit_cols[1:15],
alpha = 0.01,
verbose = FALSE
)
cat("=== Cardiac Patients Subgroup ===\n")
cat("Sample size:", nrow(cardiac_patients), "\n")
cat("Issues found:", report_cardiac$summary_stats$n_issues, "\n")
cat("Confounders:", report_cardiac$summary_stats$n_confounders, "\n")
}Once you’ve identified confounders through the audit, use them in your causal specification:
# Get the list of detected confounders
confounder_vars <- report$issues$variable[report$issues$issue_type == "Confounder"]
# If we have confounders, build a proper causal specification
if (length(confounder_vars) > 0) {
# Use detected confounders in causal spec
spec <- causal_spec(
data = rhc_clean,
treatment = "swang1",
outcome = "death_num",
covariates = confounder_vars
)
print(spec)
}
#> Warning: 4535 observations dropped due to missing values
#> Warning: 8 observations have extreme propensity scores
#> ✔ Created causal specification: n=1200, 14 covariate(s)
#>
#> -- Causal Specification --------------------------------------------------
#>
#> * Treatment: swang1 ( binary )
#> * Outcome: death_num ( continuous )
#> * Covariates: cat1, cat2, cardiohx, dementhx, psychhx, malighx, immunhx, transhx, surv2md1, t3d30, dth30, aps1, scoma1, meanbp1
#> * Sample size: 1200
#> * Estimand: ATE# Summary statistics from the audit
cat("\n=== Audit Summary ===\n")
#>
#> === Audit Summary ===
cat("Variables audited:", report$summary_stats$n_vars_audited, "\n")
#> Variables audited: 25
cat("Total issues:", report$summary_stats$n_issues, "\n")
#> Total issues: 22
cat(" - Confounders:", report$summary_stats$n_confounders, "\n")
#> - Confounders: 14
cat(" - Potential instruments:", report$summary_stats$n_instruments, "\n")
#> - Potential instruments: 4Run audit early: Before any causal analysis, audit your data to understand variable roles
Use domain knowledge: The audit identifies statistical associations; combine with clinical/domain expertise
alpha (e.g., 0.01) for larger datasets to reduce false positivesalpha (e.g., 0.10) for smaller samples to catch potential confoundersAudit subgroups: Confounding patterns may differ across patient populations
Document decisions: Record which variables you adjust for and why
Iterate: After initial analysis, re-audit to check if additional variables should be included
The audit_data() function provides an automated first pass at identifying causal structure in your dataset. It answers key questions:
This systematic approach helps ensure your causal analysis is built on a solid foundation, reducing the risk of confounding bias and improving the reliability of your conclusions.
Connors, A.F. et al. (1996). The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA, 276(11), 889-897.
Akdemir, D. (2026). Constraints on Causal Inference as Experiment Comparison. Negative control certification/bounding (thm:nc_bound).