In order to use cram_bandit()
, users must supply a
matrix of action selection probabilities \(\pi_t(X_j, A_j)\) for each combination of
policy update \(t\) and context \(j\) in the historical dataset.
While some environments log these probabilities directly, many
contextual bandit libraries (such as contextual
)
only store policy parameters (e.g., regression
coefficients) without explicit probability tracking.
This article explains how Cram Bandit Helpers reconstruct \(\pi_t(X_j, A_j)\) from these parameters for common policies:
Policy Type | Class Name |
---|---|
Epsilon-Greedy | BatchContextualEpsilonGreedyPolicy |
LinUCB Disjoint with \(\varepsilon\)-greedy exploration | BatchLinUCBDisjointPolicyEpsilon |
Thompson Sampling | BatchContextualLinTSPolicy |
Both theoretical formulas and practical code snippets are provided.
When using linear bandit algorithms like Epsilon-Greedy, LinUCB, or Thompson Sampling, each arm \(k\) maintains summary statistics (parameters) to estimate the expected reward:
\(A_k\) is the Gram
matrix:
\[
A_k = X_k^T X_k
\] where \(X_k\) is the matrix
of feature vectors (contexts) for all rounds where arm \(k\) was selected.
β Interpretation: \(A_k\) captures the amount of information
(and correlation structure) about the features for arm \(k\). It plays the role of a βfeature
covariance matrix.β
\(b_k\) is the response
vector:
\[
b_k = X_k^T y_k
\] where \(y_k\) are the
observed rewards for arm \(k\).
β Interpretation: \(b_k\) captures the relationship between the
observed rewards and the contexts for arm \(k\).
These sufficient statistics allow the policy to compute the Least Squares estimate for the reward model:
\[ \theta_k = A_k^{-1} b_k \]
where:
Thus:
The policy selects an action based on the \(\theta_k\) of each arm \(k\) and then observe the reward associated with this choice, which is used to update the parameters \(A_k\) and \(b_k\) of the policy.
In Epsilon-Greedy, with exploration rate \(\varepsilon\), the probability of selecting one of the best arms is:
\[ P(A_t | X_t) = (1 - \varepsilon) \times \frac{1}{\# \text{best arms}} + \varepsilon \times \frac{1}{K} \]
While the probability of selecting an arm that is not among the best arms is:
\[ P(A_t | X_t) = \varepsilon \times \frac{1}{K} \]
where:
We define the least squares estimate as:
\[ \theta_k = A_k^{-1} b_k \quad \text{(Least Squares estimate)} \]
where:
Best arms are identified via the estimated expected reward:
\[ \text{Expected reward} = X_t^T \theta_k \]
LinUCB selects arms based on Upper Confidence Bounds (UCBs):
\[ \text{UCB}_k(X_t) = \mu_k(X_t) + \alpha \sigma_k(X_t) \]
where:
The action probabilities follow the same structure as Epsilon-Greedy but with UCB scores instead of plain expected rewards i.e.Β the probability to select one of the best arms is:
\[ P(A_t | X_t) = (1 - \varepsilon) \times \frac{1}{\# \text{best arms}} + \varepsilon \times \frac{1}{K} \]
While the probability to select an arm that is not among the best arms is:
\[ P(A_t | X_t) = \varepsilon \times \frac{1}{K} \]
where βbest armsβ are those with highest UCB scores.
In cramR
, this is implemented by:
This function:
In Thompson Sampling, actions are sampled according to posterior draws and the action associated with the maximum value is chosen. The probability that the arm \(A_t\) is optimal is:
\[ P(A_t | X_t) = \mathbb{P}\left( \theta_{A_t}^T X_t > \theta_{k}^T X_t \quad \forall k \neq A_t \right) \]
where \(\theta_k \sim \mathcal{N}(A_k^{-1} b_k, \sigma^2 A_k^{-1})\).
This requires computing a multivariate probability, which we approximate via adaptive numerical integration.
In cramR
, this is implemented by:
This function:
When using your bandit policy in practice:
pi
, arm
, and reward
into
cram_bandit()
for evaluation of your policy.cram_bandit_sim()
The following only concerns the simulation framework we implemented for benchmarking purposes.
Once the policies are reconstructed, we compute their true expected value β referred to as the estimand β by applying the learned policy to independent contexts and evaluating it against the known reward function used in the simulation.
This is done via:
Accurately computing the estimand is critical for properly assessing the bias and confidence interval coverage of the Cram estimate in our simulations.
contextual
package: original frameworkcram_bandit()
: Cram evaluation for contextual
banditscram_bandit_sim()
: Full simulation engine with
automatic pi estimationThese helper functions were designed to faithfully reconstruct action
probabilities for the policies implemented in contextual
,
while enabling reproducible Cram-based evaluation.