This vignette demonstrates advanced usage of the causaldef package, focusing on diagnostics for unobserved confounding and sensitivity analysis. We utilize real data incorporated in the package to showcase these features.
library(causaldef)
library(ggplot2)
# DAG Helper
plot_dag <- function(coords, edges, title = NULL) {
edges_df <- merge(edges, coords, by.x = "from", by.y = "name")
colnames(edges_df)[c(3,4)] <- c("x_start", "y_start")
edges_df <- merge(edges_df, coords, by.x = "to", by.y = "name")
colnames(edges_df)[c(5,6)] <- c("x_end", "y_end")
ggplot2::ggplot(coords, ggplot2::aes(x = x, y = y)) +
ggplot2::geom_segment(data = edges_df, ggplot2::aes(x = x_start, y = y_start, xend = x_end, yend = y_end),
arrow = ggplot2::arrow(length = ggplot2::unit(0.3, "cm"), type = "closed"),
color = "gray40", size = 1, alpha = 0.8) +
ggplot2::geom_point(size = 14, color = "white", fill = "#E7B800", shape = 21, stroke = 1.5) + # different color
ggplot2::geom_text(ggplot2::aes(label = name), fontface = "bold", size = 3, color = "black") +
ggplot2::ggtitle(title) +
ggplot2::theme_void(base_size = 14) +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold", margin = ggplot2::margin(b = 10))) +
ggplot2::coord_fixed()
}Unobserved confounding is a major threat to causal inference. The causaldef package implements “Negative Control Outcomes” to detect residual confounding. A negative control outcome is a variable that is affected by unobserved confounders but is known not to be causally affected by the treatment.
We will use the gene_perturbation dataset provided in the package. This dataset comes from a CRISPR knockout experiment.
knockout_status): Whether a gene was knocked out (Treatment) or not (Control).target_expression): Expression level of the target gene.library_size and batch.housekeeping_gene): A housekeeping gene that should not be affected by the specific knockout but shares technical confounders.data("gene_perturbation")
head(gene_perturbation)
#> sample_id batch library_size knockout_status target_expression
#> 1 S1 3 948877 Knockout 8.643
#> 2 S2 4 984900 Control 11.545
#> 3 S3 4 868332 Control 12.325
#> 4 S4 2 1052105 Control 12.941
#> 5 S5 2 880271 Control 8.289
#> 6 S6 2 1001246 Knockout 9.400
#> housekeeping_gene
#> 1 6.594
#> 2 9.424
#> 3 8.568
#> 4 9.224
#> 5 6.380
#> 6 6.706We first specify the causal model, explicitly identifying the negative control outcome.
spec_nc <- causal_spec(
data = gene_perturbation,
treatment = "knockout_status",
outcome = "target_expression",
covariates = c("library_size", "batch"),
negative_control = "housekeeping_gene"
)
#> ✔ Created causal specification: n=500, 2 covariate(s)
# Visualize the Structural Assumption
coords <- data.frame(
name = c("Unobserved", "Treatment", "Target", "Housekp"),
x = c(0, -1.5, 1.5, 0),
y = c(1, 0, 0, -1)
)
edges <- data.frame(
from = c("Unobserved", "Unobserved", "Unobserved", "Treatment"),
to = c("Treatment", "Target", "Housekp", "Target")
)
plot_dag(coords, edges, title = "Why Negative Controls Work:\nShared Unobserved Confounding")Now we run the diagnostic using nc_diagnostic(). We test if the Inverse Probability of Treatment Weighting (IPTW) method successfully removes confounding bias.
set.seed(123)
# Check if IPTW approach effectively removes confounding using the negative control
nc_res <- nc_diagnostic(spec_nc, method = "iptw", n_boot = 100)
#> ℹ Using kappa = 1 (conservative). Consider domain-specific estimation or sensitivity analysis via kappa_range.
#> ✔ No evidence against causal assumptions (p = 0.14851 )
print(nc_res)
#>
#> -- Negative Control Diagnostic ----------------------------------------
#>
#> * screening statistic (weighted corr): 0.0821
#> * delta_NC (association proxy): 0.0821
#> * delta bound (under kappa alignment): 0.0821 (kappa = 1 )
#> * screening p-value: 0.14851
#> * screening method: weighted_permutation_correlation
#>
#> RESULT: NOT REJECTED. This is a screening result, not proof that confounding is absent.
#> NOTE: Your effect estimate must exceed the Noise Floor (delta_bound) to be meaningful.Interpreting Negative Control Diagnostics:
This diagnostic tests a crucial assumption: “Did my adjustment remove ALL confounding?”
Key Results to Check:
0.05: [PASS] – No evidence that treatment affects the negative control after adjustment
Scientific Implications:
The negative control is powerful because:
If falsified:
If the diagnostic fails (as it might with unobserved batch effects not fully captured), it indicates that your current adjustment set is insufficient to block all back-door paths to the negative control (and likely the outcome).
Recommended Actions:
confounding_frontier() to determine how strong this confounding would need to be to invalidate your results.When we suspect unobserved confounding, we can quantify how sensitive our causal conclusions are to potential unmeasured confounders \(U\). The confounding_frontier() function maps the “deficiency” (information loss) as a function of the strength of confounding paths \(U \to A\) (\(\alpha\)) and \(U \to Y\) (\(\gamma\)).
We will demonstrate this using the hct_outcomes dataset (Hematopoietic Cell Transplantation).
data("hct_outcomes")
# Convert treatment to numeric for the Gaussian model approximation
hct_outcomes$conditioning_numeric <- as.numeric(hct_outcomes$conditioning_intensity) - 1
# Simplified specification focusing on conditioning intensity
spec_sens <- causal_spec(
data = hct_outcomes,
treatment = "conditioning_numeric",
outcome = "time_to_event", # Simplified for this example
covariates = c("age", "disease_status")
)
#> ✔ Created causal specification: n=800, 2 covariate(s)We compute the confounding frontier. This does not require observing \(U\), but rather exploring the space of hypothetical confounding strengths.
frontier <- confounding_frontier(
spec_sens,
alpha_range = c(-1, 1),
gamma_range = c(-1, 1),
grid_size = 40
)
#> ℹ Computing benchmarks for observed covariates...
#> ✔ Computed confounding frontier: 40x40 grid
# The result contains a grid we can visualize
head(frontier$grid)
#> alpha gamma delta
#> 1 -1.0000000 -1 0.01464513
#> 2 -0.9487179 -1 0.01529749
#> 3 -0.8974359 -1 0.01600208
#> 4 -0.8461538 -1 0.01676353
#> 5 -0.7948718 -1 0.01758630
#> 6 -0.7435897 -1 0.01847429We can visualize the regions where deficiency is high (problematic) vs low.
ggplot(frontier$grid, aes(x = alpha, y = gamma, fill = delta)) +
geom_tile() +
scale_fill_viridis_c(option = "magma", direction = -1) +
labs(
title = "Confounding Frontier",
x = "Confounding Strength U -> A (alpha)",
y = "Confounding Strength U -> Y (gamma)",
fill = "Deficiency\n(Delta)"
) +
theme_minimal() +
geom_contour(aes(z = delta), color = "white", breaks = c(0.01, 0.05, 0.1))Interpreting the Confounding Frontier:
This visualization maps the “danger zones” of unobserved confounding:
The Axes:
Reading the Plot:
Dark regions (low \(\delta\) \(\approx\) 0):
Bright regions (high \(\delta\) > 0.15):
Contour lines (white lines at \(\delta\) = 0.01, 0.05, 0.1):
Scientific Decision-Making:
Use domain knowledge to assess plausibility:
Example Interpretation:
“Our sensitivity analysis shows that if unobserved confounding U -> A and U -> Y are both weak (|\(\alpha\)|, |\(\gamma\)| < 0.2), the deficiency remains below 0.05, suggesting robust identification. However, if either confounding strength exceeds 0.5, identification breaks down (\(\delta\) > 0.1). Given the biological plausibility of weak confounding in this system, we conclude our estimates are reasonably robust.”
The confounding frontier shows the boundary: if we are outside the “safe” zone defined by the frontier, we cannot claim causal identification without further assumptions (see Akdemir 2026, DOI: 10.5281/zenodo.18367347 for theoretical details).
The causaldef package provides tools not just for estimation, but for falsification and sensitivity analysis.