Policy Learning with Decision-Theoretic Bounds

Introduction

This vignette demonstrates how to use causaldef for safe policy learning — making treatment decisions with quantified guarantees even when unobserved confounding exists.

The key insight is the policy regret transfer bound:

\[\text{Regret}_{do}(\pi) \leq \text{Regret}_{obs}(\pi) + M \cdot \delta\]

where: - $\text{Regret}_{do}(\pi)$ = regret under the true interventional distribution - $\text{Regret}_{obs}(\pi)$ = regret observed in data - $M$ = utility range (max - min possible outcomes) - $\delta$ = Le Cam deficiency (quantifies confounding)

The Safety Floor Concept

policy_regret_bound() reports two complementary quantities:

Transfer penalty $M\cdot\delta$: additive worst-case regret inflation term, and
Minimax safety floor $(M/2)\cdot\delta$: irreducible worst-case regret when $\delta>0$.

If $\delta>0$, no algorithm can guarantee zero worst-case regret without stronger assumptions or randomized data.

Implications for AI/ML Safety

No algorithm can beat the safety floor: Even infinite data doesn’t help if confounding exists
Deficiency is the price of observational learning: To eliminate the safety floor, you need randomized experiments
Confidence intervals aren’t enough: Standard ML uncertainty quantification doesn’t capture confounding bias

Practical Workflow

Step 1: Define the Causal Problem

library(causaldef)
set.seed(123)

# Simulate a treatment decision problem
n <- 1000

# Covariates
age <- runif(n, 30, 70)
severity <- rbeta(n, 2, 5) * 10

# Confounded treatment assignment (sicker patients get treatment)
U <- rnorm(n)  # Unmeasured health status
ps_true <- plogis(-1 + 0.02 * age + 0.1 * severity + 0.5 * U)
A <- rbinom(n, 1, ps_true)

# Outcome: recovery score (0-100)
# True effect is heterogeneous
tau_true <- 10 + 0.2 * (age - 50)  # Older patients benefit more
Y <- 50 + tau_true * A - 0.3 * severity + 5 * U + rnorm(n, sd = 5)

# Clip to valid range
Y <- pmin(100, pmax(0, Y))

df <- data.frame(
  age = age,
  severity = severity,
  A = A,
  Y = Y
)

Step 2: Estimate Deficiency

spec <- causal_spec(
  data = df,
  treatment = "A",
  outcome = "Y",
  covariates = c("age", "severity")
)
#> ✔ Created causal specification: n=1000, 2 covariate(s)

# Estimate deficiency with multiple methods
def_results <- estimate_deficiency(
  spec,
  methods = c("unadjusted", "iptw", "aipw"),
  n_boot = 100
)
#> ℹ Estimating deficiency: unadjusted
#> ℹ Estimating deficiency: iptw
#> ℹ Estimating deficiency: aipw

print(def_results)
#> 
#> -- Deficiency Proxy Estimates (PS-TV) ------
#> 
#>      Method Delta     SE               CI           Quality
#>  unadjusted 0.054 0.0139 [0.0454, 0.0965]  Caution (Yellow)
#>        iptw 0.011 0.0042  [0.0075, 0.023] Excellent (Green)
#>        aipw 0.011 0.0042 [0.0073, 0.0223] Excellent (Green)
#> Note: delta is a propensity-score TV proxy (overlap/balance diagnostic).
#> 
#> Best method: iptw (delta = 0.011 )

Step 3: Visualize Deficiency

plot(def_results, type = "bar")

Step 4: Compute Policy Regret Bounds

# Define utility range (outcome is 0-100)
utility_range <- c(0, 100)

# Suppose our policy achieves 5% observed regret
obs_regret <- 5

# Compute bound
bounds <- policy_regret_bound(
  deficiency = def_results,
  utility_range = utility_range,
  obs_regret = obs_regret
)
#> Warning: Multiple fitted methods are available but `method` was not specified.
#> ℹ Using the smallest available delta across methods is optimistic after model
#>   selection.
#> ℹ For a pre-specified decision bound, call `policy_regret_bound()` with `method
#>   = '<chosen method>'`.
#> ℹ Transfer penalty: 1.0973 (delta = 0.011)

print(bounds)
#> 
#> -- Policy Regret Bounds -------------------------------------------------
#> 
#> * Deficiency delta: 0.011 
#> * Delta mode: point 
#> * Delta method: iptw 
#> * Delta selection: minimum across fitted methods 
#> * Utility range: [0, 100]
#> * Transfer penalty: 1.0973 (additive regret upper bound)
#> * Minimax floor: 0.5486 (worst-case lower bound)
#> 
#> * Observed regret: 5 
#> * Interventional bound: 6.0973 
#> 
#> Note: this is a plug-in bound using a deficiency proxy rather than an identified exact deficiency.
#> Note: minimum-across-methods selection is optimistic after model selection.
#> 
#> Interpretation: Transfer penalty is 1.1 % of utility range given delta

Step 5: Visualize the Safety Floor

# Show how safety floor varies with deficiency
plot(bounds, type = "safety_curve")

Interpreting the Results

The Safety Floor Report

cat("=== Policy Deployment Decision ===\n\n")
#> === Policy Deployment Decision ===

delta_best <- min(def_results$estimates)
M <- diff(utility_range)
transfer_penalty <- M * delta_best
minimax_floor <- 0.5 * M * delta_best

cat(sprintf("Best achievable deficiency: %.3f\n", delta_best))
#> Best achievable deficiency: 0.011
cat(sprintf("Transfer penalty (M*delta): %.1f points\n", transfer_penalty))
#> Transfer penalty (M*delta): 1.1 points
cat(sprintf("Minimax safety floor (M/2*delta): %.1f points\n", minimax_floor))
#> Minimax safety floor (M/2*delta): 0.5 points
cat(sprintf("Observed regret: %.1f points\n", obs_regret))
#> Observed regret: 5.0 points

if (!is.null(bounds$regret_bound)) {
  cat(sprintf("Worst-case regret: %.1f points\n", bounds$regret_bound))
}
#> Worst-case regret: 6.1 points

cat("\n")

# Decision thresholds
if (delta_best < 0.05) {
  cat("✓ EXCELLENT: Deficiency < 5%. High confidence in policy.\n")
} else if (delta_best < 0.10) {
  cat("⚠ MODERATE: Deficiency 5-10%. Proceed with monitoring.\n")
} else {
  cat("✗ CAUTION: Deficiency > 10%. Consider RCT before deployment.\n")
}
#> ✓ EXCELLENT: Deficiency < 5%. High confidence in policy.

Sensitivity Analysis with Confounding Frontiers

What if there’s additional unmeasured confounding?

# Map the confounding frontier
frontier <- confounding_frontier(
  spec,
  alpha_range = c(-2, 2),
  gamma_range = c(-2, 2),
  grid_size = 30
)
#> ℹ Computing benchmarks for observed covariates...
#> ✔ Computed confounding frontier: 30x30 grid

# Find the safe region
safe_region <- subset(frontier$grid, delta < 0.1)
cat(sprintf(
  "Safe operating region covers %.1f%% of confounding space\n",
  100 * nrow(safe_region) / nrow(frontier$grid)
))
#> Safe operating region covers 100.0% of confounding space

Visualize the Frontier

plot(frontier, type = "heatmap", threshold = c(0.05, 0.1, 0.2))

Policy Learning with grf (Optional)

If you have the grf package installed, you can use causal forests for heterogeneous treatment effect estimation with deficiency bounds:

# Estimate deficiency using causal forests
if (requireNamespace("grf", quietly = TRUE)) {
  def_grf <- estimate_deficiency(
    spec,
    methods = c("aipw", "grf"),
    n_boot = 50
  )
  
  print(def_grf)
  
  # Get individual treatment effect predictions
  kernel_grf <- def_grf$kernel$grf
  if (!is.null(kernel_grf$tau_hat)) {
    cat("\nHeterogeneous Effects Detected:\n")
    cat(sprintf("ATE from forest: %.2f\n", kernel_grf$ate))
    cat(sprintf("CATE range: [%.2f, %.2f]\n", 
                min(kernel_grf$tau_hat), 
                max(kernel_grf$tau_hat)))
  }
}

Best Practices for Safe Deployment

Pre-Deployment Checklist

Check	Threshold	Action if Failed
$\delta < 0.05$	Excellent	Deploy with confidence
$\delta \in [0.05, 0.10]$	Moderate	Deploy with active monitoring
$\delta > 0.10$	Concerning	Consider pilot RCT
NC diagnostic falsified	Any	Do not deploy without more data

Monitoring in Production

# Example: Re-estimate deficiency on new data
new_data <- ...  # Your production data

new_spec <- causal_spec(
  new_data,
  treatment = "A",
  outcome = "Y",
  covariates = c("age", "severity")
)

# Quick check
def_monitor <- estimate_deficiency(
  new_spec,
  methods = "iptw",
  n_boot = 50
)

# Alert if deficiency increased
if (def_monitor$estimates["iptw"] > 1.5 * delta_best) {
  warning("Distribution shift detected! Deficiency increased.")
}

Mathematical Details

Policy Regret Transfer (Manuscript)

For any policy $\pi$ and bounded utility function $u \in [0, M]$:

\[\mathbb{E}_{P^{do}}\left[\max_a u(a, X) - u(\pi(X), X)\right] \leq \mathbb{E}_{P^{obs}}\left[\max_a u(a, X) - u(\pi(X), X)\right] + M\delta\]

Proof sketch: The deficiency $\delta$ bounds the total variation distance between the (simulated) observational and target interventional laws. Since utility is bounded by $M$, the maximum discrepancy in expected utility is at most $M$ times the total variation gap.

Why This Matters

Traditional ML focuses on: - Prediction error: How well does my model predict $Y$? - Generalization: Does performance hold on new data?

But for causal policy learning, we need: - Interventional validity: Does my policy work when deployed? - Confounding robustness: How much could unmeasured bias hurt me?

The safety floor answers these questions with formal guarantees.

Summary

Concept	Definition	Function
Transfer penalty	$M\delta$ — additive regret inflation term	`$transfer_penalty`
Minimax safety floor	$(M/2)\delta$ — irreducible worst-case regret	`$minimax_floor`
Regret bound	observed regret + transfer penalty	`$regret_bound`
Deficiency	Information gap between obs and do	`estimate_deficiency()`
Confounding Frontier	Deficiency as function of $(\alpha, \gamma)$	`confounding_frontier()`

Use these tools to make safe, accountable decisions from observational data.

References

Akdemir, D. (2026). Constraints on Causal Inference as Experiment Comparison. DOI: 10.5281/zenodo.18367347. See thm:policy_regret (Policy Regret Transfer) and thm:safety_floor (Minimax Safety Floor).
Athey, S., & Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1), 133-161.
Kallus, N. (2020). Confounding-robust policy evaluation in infinite-horizon reinforcement learning. NeurIPS.