Advanced Causal Analysis

Introduction

This vignette demonstrates advanced usage of the causaldef package, focusing on diagnostics for unobserved confounding and sensitivity analysis. We utilize real data incorporated in the package to showcase these features.

library(causaldef)
library(ggplot2)

# DAG Helper
plot_dag <- function(coords, edges, title = NULL) {
  edges_df <- merge(edges, coords, by.x = "from", by.y = "name")
  colnames(edges_df)[c(3,4)] <- c("x_start", "y_start")
  edges_df <- merge(edges_df, coords, by.x = "to", by.y = "name")
  colnames(edges_df)[c(5,6)] <- c("x_end", "y_end")
  
  ggplot2::ggplot(coords, ggplot2::aes(x = x, y = y)) +
    ggplot2::geom_segment(data = edges_df, ggplot2::aes(x = x_start, y = y_start, xend = x_end, yend = y_end),
                 arrow = ggplot2::arrow(length = ggplot2::unit(0.3, "cm"), type = "closed"), 
                 color = "gray40", size = 1, alpha = 0.8) +
    ggplot2::geom_point(size = 14, color = "white", fill = "#E7B800", shape = 21, stroke = 1.5) + # different color
    ggplot2::geom_text(ggplot2::aes(label = name), fontface = "bold", size = 3, color = "black") +
    ggplot2::ggtitle(title) + 
    ggplot2::theme_void(base_size = 14) +
    ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold", margin = ggplot2::margin(b = 10))) +
    ggplot2::coord_fixed()
}

Unobserved Confounding and Negative Controls

Unobserved confounding is a major threat to causal inference. The causaldef package implements “Negative Control Outcomes” to detect residual confounding. A negative control outcome is a variable that is affected by unobserved confounders but is known not to be causally affected by the treatment.

We will use the gene_perturbation dataset provided in the package. This dataset comes from a CRISPR knockout experiment.

Loading the Data

Running the Negative Control Diagnostic

We first specify the causal model, explicitly identifying the negative control outcome.

Now we run the diagnostic using nc_diagnostic(). We test if the Inverse Probability of Treatment Weighting (IPTW) method successfully removes confounding bias.

Interpreting Negative Control Diagnostics:

This diagnostic tests a crucial assumption: “Did my adjustment remove ALL confounding?”

Key Results to Check:

  1. p_value:
    • 0.05: [PASS] – No evidence that treatment affects the negative control after adjustment

    • \(\leq\) 0.05: [FAIL] – Treatment still associated with negative control -> residual confounding detected
  2. falsified:
    • FALSE: Assumption not falsified -> Adjustment appears adequate
    • TRUE: Assumption falsified -> Likely unobserved confounding or model misspecification
  3. delta_nc:
    • Magnitude of residual association with negative control
    • Smaller values indicate better performance
    • Should be close to zero if adjustment is successful
  4. delta_bound:
    • Upper bound on your true deficiency: \(\delta\) \(\leq\) \(\kappa\) \(\times\) \(\delta\)_NC
    • Provides a worst-case estimate even when you can’t directly measure \(\delta\)

Scientific Implications:

The negative control is powerful because:

If falsified:

Addressing the Falsification

If the diagnostic fails (as it might with unobserved batch effects not fully captured), it indicates that your current adjustment set is insufficient to block all back-door paths to the negative control (and likely the outcome).

Recommended Actions:

  1. Enrich the Confounder Set: If possible, measure and include additional potential confounders (e.g., more granular technical variables).
  2. Sensitivity Analysis: Since you have evidence of unobserved confounding, use confounding_frontier() to determine how strong this confounding would need to be to invalidate your results.
  3. Alternative Designs: Consider methods that rely on different assumptions, such as Instrumental Variables (if available) or Difference-in-Differences.
  4. Transparent Reporting: If you cannot fix the issue, you must report the failure. “We detected residual confounding (\(\delta_{nc} = ...\)) using negative controls. Consequently, causal estimates interpretability is limited.”

Sensitivity Analysis: Confounding Frontiers

When we suspect unobserved confounding, we can quantify how sensitive our causal conclusions are to potential unmeasured confounders \(U\). The confounding_frontier() function maps the “deficiency” (information loss) as a function of the strength of confounding paths \(U \to A\) (\(\alpha\)) and \(U \to Y\) (\(\gamma\)).

We will demonstrate this using the hct_outcomes dataset (Hematopoietic Cell Transplantation).

data("hct_outcomes")

# Convert treatment to numeric for the Gaussian model approximation
hct_outcomes$conditioning_numeric <- as.numeric(hct_outcomes$conditioning_intensity) - 1

# Simplified specification focusing on conditioning intensity
spec_sens <- causal_spec(
  data = hct_outcomes,
  treatment = "conditioning_numeric",
  outcome = "time_to_event", # Simplified for this example
  covariates = c("age", "disease_status")
)
#> ✔ Created causal specification: n=800, 2 covariate(s)

We compute the confounding frontier. This does not require observing \(U\), but rather exploring the space of hypothetical confounding strengths.

frontier <- confounding_frontier(
  spec_sens,
  alpha_range = c(-1, 1),
  gamma_range = c(-1, 1),
  grid_size = 40
)
#> ℹ Computing benchmarks for observed covariates...
#> ✔ Computed confounding frontier: 40x40 grid

# The result contains a grid we can visualize
head(frontier$grid)
#>        alpha gamma      delta
#> 1 -1.0000000    -1 0.01464513
#> 2 -0.9487179    -1 0.01529749
#> 3 -0.8974359    -1 0.01600208
#> 4 -0.8461538    -1 0.01676353
#> 5 -0.7948718    -1 0.01758630
#> 6 -0.7435897    -1 0.01847429

Visualizing the Frontier

We can visualize the regions where deficiency is high (problematic) vs low.

Interpreting the Confounding Frontier:

This visualization maps the “danger zones” of unobserved confounding:

The Axes:

Reading the Plot:

  1. Dark regions (low \(\delta\) \(\approx\) 0):

    • “Safe zones” where causal identification is robust
    • Small amounts of unobserved confounding don’t destroy identification
    • You can proceed with confidence in these scenarios
  2. Bright regions (high \(\delta\) > 0.15):

    • “Danger zones” where unobserved confounding severely compromises identification
    • Large distributional distance from the ideal experiment (TV distance > 0.15)
    • Causal conclusions are unreliable without additional assumptions
  3. Contour lines (white lines at \(\delta\) = 0.01, 0.05, 0.1):

    • Boundaries between different risk levels
    • The “frontier” where identification breaks down

Scientific Decision-Making:

Use domain knowledge to assess plausibility:

Example Interpretation:
“Our sensitivity analysis shows that if unobserved confounding U -> A and U -> Y are both weak (|\(\alpha\)|, |\(\gamma\)| < 0.2), the deficiency remains below 0.05, suggesting robust identification. However, if either confounding strength exceeds 0.5, identification breaks down (\(\delta\) > 0.1). Given the biological plausibility of weak confounding in this system, we conclude our estimates are reasonably robust.”

The confounding frontier shows the boundary: if we are outside the “safe” zone defined by the frontier, we cannot claim causal identification without further assumptions (see Akdemir 2026, DOI: 10.5281/zenodo.18367347 for theoretical details).

Conclusion

The causaldef package provides tools not just for estimation, but for falsification and sensitivity analysis.

  1. Negative Controls allow us to empirically test our “no unobserved confounding” assumption using auxiliary data.
  2. Confounding Frontiers allow us to theoretically explore how robust our study is to potential unobserved confounders.