Generalizability Path Example: Characterizing Underrepresented Populations

Overview

A common challenge in translating evidence from randomized controlled trials (RCTs) to real-world practice is that trial participants may not reflect the broader target population. By definition in Parikh et al. 2025, subgroups that are “underrepresented” or “insufficiently represented” often occupy regions of the covariate space with heterogeneous treatment effects and insufficient representation in the trial data. If certain subgroups are underrepresented in the trial, estimates of the Target Average Treatment Effect (TATE) can be imprecise or misleading when transported to that population. The Sample Average Treatment Effect (SATE) is a finite sample equivalent version of the TATE.

The resulting estimand from ROOT is the Weighted Target Average Treatment Effect (WTATE): the average treatment effect restricted to the sufficiently represented subpopulation, estimated with lower variance than the unweighted TATE.

This vignette walks through a complete generalizability analysis using the built-in diabetes_data dataset.


The diabetes_data Dataset

diabetes_data is a simulated dataset that mimics a diabetes intervention study. There are 2,000 individuals in a randomized controlled trial (RCT) sample, and there are 8,000 individuals in this simulated population we are making inferences to.

library(ROOT)

data(diabetes_data, package = "ROOT")
str(diabetes_data)
#> 'data.frame':    10000 obs. of  7 variables:
#>  $ Race_Black: int  0 1 1 0 0 0 0 0 0 0 ...
#>  $ Sex_Male  : int  1 0 1 1 1 1 1 0 1 1 ...
#>  $ DietYes   : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Age45     : int  0 1 0 0 1 0 0 0 1 0 ...
#>  $ S         : int  1 0 0 0 0 0 0 0 0 0 ...
#>  $ Tr        : int  0 NA NA NA NA NA NA NA NA NA ...
#>  $ Y         : num  0.818 NA NA NA NA ...

The key columns are:

Column Description
Y Observed outcome (numeric)
Tr Treatment assignment (0 = control, 1 = treated)
S Sample indicator (1 = RCT, 0 = target population)
Age45 Age ≥ 45 (binary indicator)
DietYes Currently on a diet programme (binary indicator)
Race_Black Race: Black (binary indicator)
Sex_Male Sex: Male (binary indicator)
# How many trial vs target population units?
table(S = diabetes_data$S)
#> S
#>    0    1 
#> 8000 2000

# Treatment breakdown within the trial
table(Tr = diabetes_data$Tr[diabetes_data$S == 1])
#> Tr
#>    0    1 
#>  977 1023

Checking Covariate Overlap

Before running ROOT, it is good practice to check whether trial participants differ from the target population on key covariates. Systematic differences signal which subgroups may be underrepresented.

# Mean of each covariate by S
covariate_cols <- c("Age45", "DietYes", "Race_Black", "Sex_Male")

overlap <- sapply(covariate_cols, function(v) {
  tapply(diabetes_data[[v]], diabetes_data$S, mean, na.rm = TRUE)
})

knitr::kable(
  t(overlap),
  digits  = 3,
  caption = "Covariate means by sample membership (S = 1: trial, S = 0: target)"
)
Covariate means by sample membership (S = 1: trial, S = 0: target)
0 1
Age45 0.153 0.154
DietYes 0.099 0.096
Race_Black 0.315 0.172
Sex_Male 0.460 0.557

Differences across rows flag potential sources of underrepresentation that ROOT will attempt to characterize.


Fitting ROOT in Generalizability Mode

We use characterizing_underrep(), which is the high-level wrapper around ROOT() for generalizability/transportability analyses. It expects data to contain Y, Tr, and S, and internally:

  1. Estimates transportability scores using logistic regression models (default) for \(P(S = 1 \mid X)\) and \(P(\text{Tr} = 1 \mid X, S = 1)\).
  2. Constructs Horvitz–Thompson-style influence scores \(v_i\).
  3. Grows a forest of weighted trees that minimize the variance of the weighted estimator \(\widehat{\text{WTATE}}\).
  4. Selects a Rashomon set of the top-\(k\) trees and aggregates their weight assignments by majority vote (default).
  5. Fits a single summary tree characterizing the final \(w_{\text{opt}}\) assignments.
gen_fit <- characterizing_underrep(
  data                  = diabetes_data,
  generalizability_path = TRUE,
  num_trees             = 20,
  top_k_trees           = TRUE,
  k                     = 10,
  seed                  = 123
)

Inspecting the Results

Detailed summary

summary() additionally reports the Rashomon set size, the percentage of observations with \(w_{\text{opt}} = 1\), and the unweighted and weighted estimands with their standard errors.

summary(gen_fit)
#> characterizing_underrep object
#>   --- ROOT summary ---
#> ROOT object
#>   Generalizability mode: TRUE 
#> 
#> Summary classifier (f):
#> n= 2000 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 2000 735 1 (0.36750000 0.63250000)  
#>    2) Race_Black>=0.5 345   0 0 (1.00000000 0.00000000) *
#>    3) Race_Black< 0.5 1655 390 1 (0.23564955 0.76435045)  
#>      6) Age45>=0.5 260   0 0 (1.00000000 0.00000000) *
#>      7) Age45< 0.5 1395 130 1 (0.09318996 0.90681004)  
#>       14) DietYes>=0.5 130   0 0 (1.00000000 0.00000000) *
#>       15) DietYes< 0.5 1265   0 1 (0.00000000 1.00000000) *
#> 
#> Global objective function:
#>   User-supplied: No (default objective used)
#> 
#> Estimand summary (generalization mode):
#>   Unweighted SATE = 5.48424, SE = 0.4290411
#>   Weighted   WTATE = 3.55263, SE = 0.3609503
#> 
#> Diagnostics:
#>   Number of trees grown: 20
#>   Rashomon set size: 10
#>   % observations with w_opt == 1: 63.2%
#> 
#>   Leaf summary:    4 terminal nodes
#>  leaf_id                                               rule predicted_w    n
#>        2                             root & Race_Black>=0.5           0  345
#>        6                root & Race_Black< 0.5 & Age45>=0.5           0  260
#>       14 root & Race_Black< 0.5 & Age45< 0.5 & DietYes>=0.5           0  130
#>       15 root & Race_Black< 0.5 & Age45< 0.5 & DietYes< 0.5           1 1265
#>   pct                           label
#>  17.2 Under-represented (drop, w = 0)
#>  13.0 Under-represented (drop, w = 0)
#>   6.5 Under-represented (drop, w = 0)
#>  63.2       Represented (keep, w = 1)

The SATE (unweighted) is the simple trial average treatment effect transported to the full target population. The WTATE (weighted) restricts this estimate to the well-represented subpopulation, where the trial provides more reliable evidence. A smaller standard error (SE) for the WTATE relative to the SATE reflects the variance reduction achieved by this restriction.

Terminal node rules

The leaf_summary component of the returned object gives an explicit human-readable rule for each terminal node of the summary tree, along with the number and percentage of observations in each leaf and whether they are classified as represented (\(w = 1\)) or underrepresented (\(w = 0\)).

gen_fit$leaf_summary
#>   leaf_id                                               rule predicted_w    n
#> 1       2                             root & Race_Black>=0.5           0  345
#> 2       6                root & Race_Black< 0.5 & Age45>=0.5           0  260
#> 3      14 root & Race_Black< 0.5 & Age45< 0.5 & DietYes>=0.5           0  130
#> 4      15 root & Race_Black< 0.5 & Age45< 0.5 & DietYes< 0.5           1 1265
#>    pct                           label
#> 1 17.2 Under-represented (drop, w = 0)
#> 2 13.0 Under-represented (drop, w = 0)
#> 3  6.5 Under-represented (drop, w = 0)
#> 4 63.2       Represented (keep, w = 1)

Visualizing the Characterization Tree

plot() renders the final characterized tree from the Rashomon set. Blue leaves (\(w = 1\)) denote well-represented subgroups; orange leaves (\(w = 0\)) denote underrepresented subgroups. The percentage shown in each leaf is the share of trial units falling into that node.

plot(gen_fit)

Characterized tree for diabetes generalizability analysis

The tree reads top-down as a decision rule: starting from the root (all trial units), the first split separates subgroups that are wholly underrepresented from those that may be included. Follow the branches down to each leaf to read the complete inclusion/exclusion rule for that subgroup.


Interpreting the Output

From the characterized tree and leaf summary, we can describe the underrepresented subgroups in plain language:

The Rashomon set provides multiple near-optimal characterizations of these subgroups. The final summary tree aggregates across all trees in the set, giving a single interpretable rule.


Key Parameters

Parameter Role Default
num_trees Number of trees to grow in the forest 10
top_k_trees If TRUE, select the top k trees by objective value FALSE
k Rashomon set size when top_k_trees = TRUE 10
cutoff Rashomon threshold when top_k_trees = FALSE; "baseline" uses the objective at \(w \equiv 1\) "baseline"
vote_threshold Fraction of Rashomon-set trees that must vote \(w = 1\) for a unit to be included 2/3
seed Random seed for reproducibility NULL
feature_est Feature importance method used to bias split selection ("Ridge", "GBM", or a custom function) "Ridge"
leaf_proba Controls tree depth by increasing the probability of stopping at a leaf 0.25

Reference

Parikh, H., Ross, R. K., Stuart, E., & Rudolph, K. E. (2025). Who Are We Missing?: A Principled Approach to Characterizing the Underrepresented Population. Journal of the American Statistical Association, 120(551), 1414–1423. https://doi.org/10.1080/01621459.2025.2495319