In the article “Introduction & Cram Policy Part 1”, we
introduced the Cram method, which enables simultaneous learning and
evaluation of a binary policy. In this section, we extend the framework
to machine learning tasks through the cram_ml()
function.
Cram ML outputs the Expected Loss Estimate, which
refers to the following statistical quantity:
\[
R(\hat{\pi}) = \mathbb{E}_{\tilde{X} \sim D} \left[ L(\tilde{X},
\hat{\pi}) \right],
\] The Expected Loss Estimate represents the average loss that
would be incurred if a model, trained on a given data sample, were
deployed across the entire population. In the Cram framework, this
corresponds to estimating how the learned model generalizes to unseen
data—i.e., how it performs on new observations \(\tilde{X}\) drawn from the true
data-generating distribution \(D\),
independently of the training data.
This expected loss serves as the population-level performance metric (analogous to a policy value in policy learning), and Cram provides a consistent, low-bias estimate of this quantity by combining models trained on sequential batches and evaluating them on held-out observations.
To illustrate the use of cram_ml()
, we begin by
generating a synthetic dataset for a regression task. The data consists
of three independent covariates and a continuous outcome.
set.seed(42)
X_data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100))
Y_data <- rnorm(100)
data_df <- data.frame(X_data, Y = Y_data)
This section illustrates how to use cram_ml()
with
built-in modeling options available through the cramR
package. The function integrates with the caret
framework,
allowing users to specify a learning algorithm, a loss function, and a
batching strategy to evaluate model performance.
Beyond caret
, cram_ml()
also supports fully
custom model training, prediction, and loss functions, making it
suitable for virtually any machine learning task — including regression
and classification.
The cram_ml()
function offers extensive flexibility
through its loss_name
and caret_params
arguments.
The loss_name
argument specifies the performance metric
used to evaluate the model at each batch. Note that Cram needs to
calculate individual losses (i.e. map a data point and prediction to a
loss value) that are then internally averaged across batches and
observations to form the Expected Loss Estimate.
Depending on the task, losses are interpreted as follows:
We denote \(x_i\) a data point and \(\hat{\pi}_k\) a model trained on the first k batches of data to illustrate how the individual losses are computed using the built-in loss names of the package.
Squared Error ("se"
):
\[
L(x_i, \hat{\pi}_k) = (\hat{y}_i - y_i)^2
\] Measures the squared difference between predicted and actual
outcomes.
Absolute Error ("ae"
):
\[
L(x_i, \hat{\pi}_k) = |\hat{y}_i - y_i|
\] Captures the magnitude of prediction error, regardless of
direction.
Accuracy ("accuracy"
):
\[
L(x_i, \hat{\pi}_k) = 1\{\hat{y}_i = y_i\}
\] It is more a performance metric than a loss here - Cram allows
you to define any performance metric that you want to estimate and
accuracy is a built-in example. The metric is 1 for correct predictions,
0 for incorrect ones.
Logarithmic Loss ("logloss"
):
The "logloss"
loss function measures how well predicted
class probabilities align with the true class labels. It applies to
both binary and multiclass classification tasks.
For a given observation \(i\), let:
\(y_i \in \{c_1, c_2, \dots, c_K\}\) be the true class label,
\(\hat{p}_k(i, c)\) be the predicted probability assigned to class \(c\) by the model.
The individual log loss is computed as: \[ L(x_i, \hat{\pi}_k) = -\log\left( \hat{p}_k(i, y_i) \right) \] That is, we take the negative log of the probability assigned to the true class.
Users can also define their own custom loss function by providing a
custom_loss(predictions, data)
function that returns a
vector of individual losses. This allows evaluation of complex models
with domain-specific metrics (more details in the Custom Model part
below)
The caret_params
list defines how the model should be
trained using the caret
package. It can include any argument supported by
caret::train()
, allowing full control over model
specification and tuning. Common components include:
method
: the machine learning algorithm (e.g.,
"lm"
for linear regression, "rf"
for random
forest, "xgbTree"
for XGBoost, "svmLinear"
for
support vector machines)trControl
: the resampling strategy (e.g.,
trainControl(method = "cv", number = 5)
for 5-fold
cross-validation, or "none"
for training without
resampling)tuneGrid
: a grid of hyperparameters for tuning (e.g.,
expand.grid(mtry = c(2, 3, 4))
)metric
: the model selection metric used during tuning
(e.g., "RMSE"
or "Accuracy"
)preProcess
: optional preprocessing steps (e.g.,
centering, scaling)importance
: logical flag to compute variable importance
(useful for tree-based models)Refer to the full documentation at caret model training and tuning for the complete list of supported arguments and options.
caret_params_lm <- list(
method = "lm",
trControl = trainControl(method = "none")
)
result <- cram_ml(
data = data_df,
formula = Y ~ .,
batch = 5,
loss_name = "se",
caret_params = caret_params_lm
)
print(result)
#> $raw_results
#> Metric Value
#> 1 Expected Loss Estimate 0.86429
#> 2 Expected Loss Standard Error 0.73665
#> 3 Expected Loss CI Lower -0.57952
#> 4 Expected Loss CI Upper 2.30809
#>
#> $interactive_table
#>
#> $final_ml_model
#> Linear Regression
#>
#> 100 samples
#> 3 predictor
#>
#> No pre-processing
#> Resampling: None
The cram_ml()
function can also be used for
classification tasks, whether predicting hard labels or
class probabilities. This is controlled via the classify
argument and loss_name
. Below, we demonstrate two typical
use cases.
Also note that all data inputs needs to be of numeric types, hence
for Y
categorical, it should contain numeric values
representing the class of each observation. No need to use the type
factor
for cram_ml()
.
In this case, the model outputs hard predictions (labels, e.g. 0, 1, 2 etc.), and the metric used is classification accuracy—the proportion of correctly predicted labels.
loss_name = "accuracy"
classProbs = FALSE
in
trainControl
classify = TRUE
in cram_ml()
set.seed(42)
# Generate binary classification dataset
X_data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100))
Y_data <- rbinom(nrow(X_data), 1, 0.5)
data_df <- data.frame(X_data, Y = Y_data)
# Define caret parameters: predict labels (default behavior)
caret_params_rf <- list(
method = "rf",
trControl = trainControl(method = "none")
)
# Run CRAM ML with accuracy as loss
result <- cram_ml(
data = data_df,
formula = Y ~ .,
batch = 5,
loss_name = "accuracy",
caret_params = caret_params_rf,
classify = TRUE
)
print(result)
#> $raw_results
#> Metric Value
#> 1 Expected Loss Estimate 0.48750
#> 2 Expected Loss Standard Error 0.43071
#> 3 Expected Loss CI Lower -0.35668
#> 4 Expected Loss CI Upper 1.33168
#>
#> $interactive_table
#>
#> $final_ml_model
#> Random Forest
#>
#> 100 samples
#> 3 predictor
#> 2 classes: 'class0', 'class1'
#>
#> No pre-processing
#> Resampling: None
In this setup, the model outputs class
probabilities, and the loss is evaluated using
logarithmic loss (logloss
)—a standard
metric for probabilistic classification.
loss_name = "logloss"
classProbs = TRUE
in trainControl
classify = TRUE
in cram_ml()
set.seed(42)
# Generate binary classification dataset
X_data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100))
Y_data <- rbinom(nrow(X_data), 1, 0.5)
data_df <- data.frame(X_data, Y = Y_data)
# Define caret parameters for probability output
caret_params_rf_probs <- list(
method = "rf",
trControl = trainControl(method = "none", classProbs = TRUE)
)
# Run CRAM ML with logloss as the evaluation loss
result <- cram_ml(
data = data_df,
formula = Y ~ .,
batch = 5,
loss_name = "logloss",
caret_params = caret_params_rf_probs,
classify = TRUE
)
print(result)
#> $raw_results
#> Metric Value
#> 1 Expected Loss Estimate 0.93225
#> 2 Expected Loss Standard Error 0.48118
#> 3 Expected Loss CI Lower -0.01085
#> 4 Expected Loss CI Upper 1.87534
#>
#> $interactive_table
#>
#> $final_ml_model
#> Random Forest
#>
#> 100 samples
#> 3 predictor
#> 2 classes: 'class0', 'class1'
#>
#> No pre-processing
#> Resampling: None
Together, these arguments allow users to apply cram_ml()
using a wide variety of built-in machine learning models and losses. If
users need to go beyond these built-in choices, we also provide in the
next section a friendly workflow on how to specify custom models and
losses with cram_ml()
.
In addition to using built-in learners via caret
,
cram_ml()
also supports fully custom model
workflows. You can specify your own:
custom_fit
)custom_predict
)custom_loss
)This offers maximum flexibility, allowing CRAM to evaluate any learning model with any performance criterion, including regression, classification, or even unsupervised losses such as clustering distance.
custom_fit(data, ...)
This function takes a data frame and returns a fitted model. You may define additional arguments such as hyperparameters or training settings.
data
: A data frame that includes both predictors and
the outcome variable Y
.Example: A basic linear model fit on three predictors:
custom_predict(model, data)
This function generates predictions from the fitted model on new data. It returns a numeric vector of predicted outcomes.
model
: The fitted model returned by
custom_fit()
data
: A data frame of new observations (typically
including all original predictors)Example: Extract predictors and apply a standard
predict()
call:
custom_loss(predictions, data)
This function defines the loss metric used to evaluate model
predictions. It should return a numeric vector of individual
losses, one per observation. These are internally aggregated by
cram_ml()
to compute the overall performance.
predictions
: A numeric vector of predicted values from
the modeldata
: The data frame containing the true outcome values
(Y
)Example: Define a custom loss function using Squared Error (SE)
cram_ml()
with Custom FunctionsOnce you have defined your custom training, prediction, and loss
functions, you can pass them directly to cram_ml()
as shown
below, note that caret_params
and loss_name
that were used for built-in functionalities are now
NULL
:
set.seed(42)
X_data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100))
Y_data <- rnorm(100)
data_df <- data.frame(X_data, Y = Y_data)
result <- cram_ml(
data = data_df,
formula = Y ~ .,
batch = 5,
custom_fit = custom_fit,
custom_predict = custom_predict,
custom_loss = custom_loss
)
print(result)
#> $raw_results
#> Metric Value
#> 1 Expected Loss Estimate 0.86429
#> 2 Expected Loss Standard Error 0.73665
#> 3 Expected Loss CI Lower -0.57952
#> 4 Expected Loss CI Upper 2.30809
#>
#> $interactive_table
#>
#> $final_ml_model
#>
#> Call:
#> lm(formula = Y ~ x1 + x2 + x3, data = data)
#>
#> Coefficients:
#> (Intercept) x1 x2 x3
#> 0.031503 0.057754 0.008829 -0.031611