Safe formula-based wrapper around glmnet:
bring glm()’s strict modeling-workflow semantics — formula
+ data in, predict-time design matrix reconstructed from a frozen recipe
out — to L1 / L2 / elastic-net regularized GLMs.
glmnet is the de facto standard for regularized GLMs in
R, but its matrix-shaped API hands the full model recipe back to the
caller: factor encoding, contrasts, complete-case filtering, rank
deficiency, and predict-time design-matrix reconstruction. The recurring
failure modes are:
glmnet::predict() errors on a column-width
mismatch;options("contrasts") setting differs
between fit and predict, so the design matrix silently changes
meaning;0 — visually identical to a coefficient the L1 penalty
shrunk to zero;fbrglm brings stats::glm()’s strict
conventions to the glmnet engine: frozen terms
/ xlevels / contrasts on the fit, QR-pivot
rank-deficient column drop with NA in coef() /
summary(), glm-style error on novel test factor levels
(with an opt-in on_new_levels = "na" for production
scoring), explicit nobs_info complete-case counts, and an
explicit refusal to print classical SE / z / p / CI under
infer = "none". The underlying glmnet /
cv.glmnet calls are unchanged — predictions are
bit-identical to a hand-built raw-glmnet call across every
glmnet-supported family — and reachable through as_glmnet()
/ as_cv_glmnet() for downstream tooling.
infer = "none" only. Family coverage is:
Core supported (parity-checked against raw
glmnet): gaussian, binomial,
poisson, Gamma (via
stats::Gamma(link = "log")), negative binomial
(via MASS::negative.binomial(theta = ...) — fixed θ
only).
Experimental (basic fit / predict paths work,
breadth of usage not exhaustively tested): native Cox
(family = "cox" with
Surv(time, status) ~ ...), multinomial,
mgaussian.
Out of scope for the MVP: - Joint θ estimation in
the style of MASS::glm.nb(). - Cox-specific extras such as
strata, ties handling, and time-varying covariates have not been
validated. - infer = "split" and
infer = "selective" are planned but not implemented.
See TODO.md for the full backlog.
Recommended:
pak::pkg_install("dsc-chiba-u/fbrglm")Alternative:
remotes::install_github("dsc-chiba-u/fbrglm")
# or
devtools::install_github("dsc-chiba-u/fbrglm")library(fbrglm)
set.seed(1)
n <- 100
df <- data.frame(
y = rnorm(n),
x1 = rnorm(n),
x2 = rnorm(n)
)
fit <- fbrglm(y ~ x1 + x2, data = df,
family = "gaussian",
lambda = "cv_min")
coef(fit)
head(predict(fit, type = "response"))
nobs(fit)library(fbrglm)
set.seed(2)
n <- 200
df <- data.frame(
y = rbinom(n, 1, 0.5),
x1 = rnorm(n),
x2 = rnorm(n)
)
fit <- fbrglm(y ~ x1 + x2, data = df,
family = "binomial",
lambda = "cv_1se")
head(predict(fit, type = "response")) # probabilities in [0, 1]
head(predict(fit, type = "class")) # 0/1Factors are auto-dummied by model.matrix(). The training
factor levels are stored on the fit so
predict(newdata = ...) still works when some levels are
missing from the test data.
library(fbrglm)
set.seed(3)
n <- 200
train <- data.frame(
y = rnorm(n),
x1 = rnorm(n),
g = factor(sample(c("A", "B", "C"), n, replace = TRUE),
levels = c("A", "B", "C"))
)
fit <- fbrglm(y ~ x1 + g, data = train,
family = "gaussian",
lambda = "fix", lambda_value = 0.05)
# newdata missing level "C" — still works
test <- data.frame(
x1 = rnorm(10),
g = factor(rep(c("A", "B"), 5), levels = c("A", "B", "C"))
)
predict(fit, newdata = test, type = "response")By default, predict() errors on a factor value the model
has not seen — same as stats::predict.glm().
Production-style batch scoring can opt into a softer mode:
# `test_unseen$g` contains a new level "D" not in the training data;
# rows with that value get NA, the rest score normally, and a warning
# names how many rows were dropped.
predict(fit, newdata = test_unseen, type = "response",
on_new_levels = "na")lambda selectionfbrglm() exposes three rules through one argument:
lambda |
meaning | backend |
|---|---|---|
"cv_min" |
cv.glmnet()$lambda.min (default) |
glmnet::cv.glmnet() |
"cv_1se" |
cv.glmnet()$lambda.1se |
glmnet::cv.glmnet() |
"fix" |
uses lambda_value directly |
glmnet::glmnet(lambda = ...) |
lambda = "fix" requires lambda_value
(numeric). The numeric λ that the fit actually used is always available
as fit$lambda_value.
predict(type = ...)For the single-response GLM families (gaussian,
binomial, poisson, Gamma,
negative.binomial, cox):
type |
gaussian / Gamma / NB | binomial | poisson | cox |
|---|---|---|---|---|
"link" |
η | η (logit) | η (log) | η (linear predictor) |
"response" |
mean (link⁻¹η) | probability ∈ [0, 1] | rate, exp(η) |
relative risk, exp(η) |
"class" |
error | 0 / 1 (threshold 0.5) | error | error |
Cox predictions are on the linear-predictor / relative-risk scale,
not absolute hazards. fbrglm does not estimate the baseline hazard;
absolute survival curves require a separate baseline-hazard step (e.g.
via survival::basehaz() on a coxph fit, or an
equivalent Breslow estimate computed against the coxnet
linear predictors).
For the multi-response families:
multinomial: "link" /
"response" return an (n × k) matrix (one
column per class); "class" returns the argmax class
label.mgaussian: "link" /
"response" return an (n × q) matrix (one
column per response); "class" errors.See vignette("fbrglm-families") for worked examples.
nobs_infoComplete-case filtering happens automatically. The numbers are
exposed at fit$nobs_info:
fit$nobs_info$n_total # rows in the input data
fit$nobs_info$n_dropped_missing # rows with any NA in the model.frame
fit$nobs_info$n_used # rows actually fit
nobs(fit) # same as $n_usedIf any rows are dropped, fbrglm() prints a one-line
message.
infer = "none" for nowOnly infer = "none" is currently enabled, and
summary() deliberately does not report classical
SE, z, p-values, or confidence intervals: shrinkage bias,
data-driven λ selection, and active-set conditioning all break the
textbook interpretation of those quantities. The summary()
output instead carries a permanent footer naming the three failure modes
and the planned remediation paths (infer = "split" for
sample-split refits, infer = "selective" for selective
inference at the chosen λ). coef() returns the regularized
point estimates with NA for any column dropped by the
QR-pivot rank check; summary() adds the glm-style
“(N not defined because of singularities: ...)” header when
that happens, the complete-case nobs_info triple, and the
inference policy footer.
infer = "split" — data splitting with
selection_frac for honest post-selection SEs / p-values /
CIs via a base-R glm() refit.infer = "selective" — selective inference at the chosen
λ.MASS::glm.nb()-style).Full list and rationale: TODO.md.
vignette("fbrglm")) — formula
/ data, λ selection, predict(type = ...),
nobs_info, factor narrowing, offsets.vignette("fbrglm-families")) — worked examples for linear,
logistic, Poisson (with offset), Gamma, negative binomial (fixed θ),
native Cox (Surv() LHS), plus the experimental multinomial
and mgaussian paths. Two survival routes are shown explicitly: the
piecewise exponential Poisson model on long-format data
and the native Cox path via
family = "cox"; these are different models and the vignette
does not conflate them.Smoke tests and benchmarks live in a separate repository: https://github.com/dsc-chiba-u/fbrglm-experiments.
Currently it contains:
glmnet parity at fixed λ, cv.glmnet parity at
cv_min / cv_1se, factor newdata, complete-case
bookkeeping)Comparison methods covered in the small benchmarks:
glmnetglmnetUtilsparsnip / workflows with the
glmnet engineHeadline observations from the prediction-failure benchmark, in two factor-level scenarios:
fbrglm, the glmnet_raw_safe (manual re-level)
path, glmnetUtils with use.model.frame = TRUE,
and the parsnip / workflows pipeline all
succeed. Default glmnet_raw_naive and default
glmnetUtils fail with a column-width error.fbrglm’s default reproduces
stats::predict.glm()’s “factor has new levels” error
verbatim, as does glmnetUtils with
use.model.frame = TRUE and stats::glm()
itself. The default fast paths get the column-width error instead.
parsnip warns and silently coerces novel cells to the
reference level; fbrglm’s opt-in
on_new_levels = "na" is the only path that returns
NA at novel-level rows and finite predictions
elsewhere.MIT — see LICENSE.