Policy Learning with Decision-Theoretic Bounds

Deniz Akdemir

2026-03-26

Introduction

This vignette demonstrates how to use causaldef for safe policy learning — making treatment decisions with quantified guarantees even when unobserved confounding exists.

The key insight is the policy regret transfer bound:

\[\text{Regret}_{do}(\pi) \leq \text{Regret}_{obs}(\pi) + M \cdot \delta\]

where: - \(\text{Regret}_{do}(\pi)\) = regret under the true interventional distribution - \(\text{Regret}_{obs}(\pi)\) = regret observed in data - \(M\) = utility range (max - min possible outcomes) - \(\delta\) = Le Cam deficiency (quantifies confounding)

The Safety Floor Concept

policy_regret_bound() reports two complementary quantities:

If \(\delta>0\), no algorithm can guarantee zero worst-case regret without stronger assumptions or randomized data.

Implications for AI/ML Safety

  1. No algorithm can beat the safety floor: Even infinite data doesn’t help if confounding exists
  2. Deficiency is the price of observational learning: To eliminate the safety floor, you need randomized experiments
  3. Confidence intervals aren’t enough: Standard ML uncertainty quantification doesn’t capture confounding bias

Practical Workflow

Step 1: Define the Causal Problem

Step 2: Estimate Deficiency

Step 3: Visualize Deficiency

Step 4: Compute Policy Regret Bounds

Step 5: Visualize the Safety Floor

Interpreting the Results

The Safety Floor Report

Sensitivity Analysis with Confounding Frontiers

What if there’s additional unmeasured confounding?

# Map the confounding frontier
frontier <- confounding_frontier(
  spec,
  alpha_range = c(-2, 2),
  gamma_range = c(-2, 2),
  grid_size = 30
)
#> ℹ Computing benchmarks for observed covariates...
#> ✔ Computed confounding frontier: 30x30 grid

# Find the safe region
safe_region <- subset(frontier$grid, delta < 0.1)
cat(sprintf(
  "Safe operating region covers %.1f%% of confounding space\n",
  100 * nrow(safe_region) / nrow(frontier$grid)
))
#> Safe operating region covers 100.0% of confounding space

Visualize the Frontier

Policy Learning with grf (Optional)

If you have the grf package installed, you can use causal forests for heterogeneous treatment effect estimation with deficiency bounds:

# Estimate deficiency using causal forests
if (requireNamespace("grf", quietly = TRUE)) {
  def_grf <- estimate_deficiency(
    spec,
    methods = c("aipw", "grf"),
    n_boot = 50
  )
  
  print(def_grf)
  
  # Get individual treatment effect predictions
  kernel_grf <- def_grf$kernel$grf
  if (!is.null(kernel_grf$tau_hat)) {
    cat("\nHeterogeneous Effects Detected:\n")
    cat(sprintf("ATE from forest: %.2f\n", kernel_grf$ate))
    cat(sprintf("CATE range: [%.2f, %.2f]\n", 
                min(kernel_grf$tau_hat), 
                max(kernel_grf$tau_hat)))
  }
}

Best Practices for Safe Deployment

Pre-Deployment Checklist

Check Threshold Action if Failed
\(\delta < 0.05\) Excellent Deploy with confidence
\(\delta \in [0.05, 0.10]\) Moderate Deploy with active monitoring
\(\delta > 0.10\) Concerning Consider pilot RCT
NC diagnostic falsified Any Do not deploy without more data

Monitoring in Production

Mathematical Details

Policy Regret Transfer (Manuscript)

For any policy \(\pi\) and bounded utility function \(u \in [0, M]\):

\[\mathbb{E}_{P^{do}}\left[\max_a u(a, X) - u(\pi(X), X)\right] \leq \mathbb{E}_{P^{obs}}\left[\max_a u(a, X) - u(\pi(X), X)\right] + M\delta\]

Proof sketch: The deficiency \(\delta\) bounds the total variation distance between the (simulated) observational and target interventional laws. Since utility is bounded by \(M\), the maximum discrepancy in expected utility is at most \(M\) times the total variation gap.

Why This Matters

Traditional ML focuses on: - Prediction error: How well does my model predict \(Y\)? - Generalization: Does performance hold on new data?

But for causal policy learning, we need: - Interventional validity: Does my policy work when deployed? - Confounding robustness: How much could unmeasured bias hurt me?

The safety floor answers these questions with formal guarantees.

Summary

Concept Definition Function
Transfer penalty \(M\delta\) — additive regret inflation term $transfer_penalty
Minimax safety floor \((M/2)\delta\) — irreducible worst-case regret $minimax_floor
Regret bound observed regret + transfer penalty $regret_bound
Deficiency Information gap between obs and do estimate_deficiency()
Confounding Frontier Deficiency as function of \((\alpha, \gamma)\) confounding_frontier()

Use these tools to make safe, accountable decisions from observational data.

References

  1. Akdemir, D. (2026). Constraints on Causal Inference as Experiment Comparison. DOI: 10.5281/zenodo.18367347. See thm:policy_regret (Policy Regret Transfer) and thm:safety_floor (Minimax Safety Floor).

  2. Athey, S., & Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1), 133-161.

  3. Kallus, N. (2020). Confounding-robust policy evaluation in infinite-horizon reinforcement learning. NeurIPS.