fbrglm

Safe formula-based wrapper around glmnet: bring glm()’s strict modeling-workflow semantics — formula + data in, predict-time design matrix reconstructed from a frozen recipe out — to L1 / L2 / elastic-net regularized GLMs.

Why

glmnet is the de facto standard for regularized GLMs in R, but its matrix-shaped API hands the full model recipe back to the caller: factor encoding, contrasts, complete-case filtering, rank deficiency, and predict-time design-matrix reconstruction. The recurring failure modes are:

fbrglm brings stats::glm()’s strict conventions to the glmnet engine: frozen terms / xlevels / contrasts on the fit, QR-pivot rank-deficient column drop with NA in coef() / summary(), glm-style error on novel test factor levels (with an opt-in on_new_levels = "na" for production scoring), explicit nobs_info complete-case counts, and an explicit refusal to print classical SE / z / p / CI under infer = "none". The underlying glmnet / cv.glmnet calls are unchanged — predictions are bit-identical to a hand-built raw-glmnet call across every glmnet-supported family — and reachable through as_glmnet() / as_cv_glmnet() for downstream tooling.

Status

infer = "none" only. Family coverage is:

Core supported (parity-checked against raw glmnet): gaussian, binomial, poisson, Gamma (via stats::Gamma(link = "log")), negative binomial (via MASS::negative.binomial(theta = ...)fixed θ only).

Experimental (basic fit / predict paths work, breadth of usage not exhaustively tested): native Cox (family = "cox" with Surv(time, status) ~ ...), multinomial, mgaussian.

Out of scope for the MVP: - Joint θ estimation in the style of MASS::glm.nb(). - Cox-specific extras such as strata, ties handling, and time-varying covariates have not been validated. - infer = "split" and infer = "selective" are planned but not implemented.

See TODO.md for the full backlog.

Installation

Recommended:

pak::pkg_install("dsc-chiba-u/fbrglm")

Alternative:

remotes::install_github("dsc-chiba-u/fbrglm")
# or
devtools::install_github("dsc-chiba-u/fbrglm")

Quick start

Gaussian

library(fbrglm)

set.seed(1)
n <- 100
df <- data.frame(
    y  = rnorm(n),
    x1 = rnorm(n),
    x2 = rnorm(n)
)

fit <- fbrglm(y ~ x1 + x2, data = df,
              family = "gaussian",
              lambda = "cv_min")

coef(fit)
head(predict(fit, type = "response"))
nobs(fit)

Binomial

library(fbrglm)

set.seed(2)
n <- 200
df <- data.frame(
    y  = rbinom(n, 1, 0.5),
    x1 = rnorm(n),
    x2 = rnorm(n)
)

fit <- fbrglm(y ~ x1 + x2, data = df,
              family = "binomial",
              lambda = "cv_1se")

head(predict(fit, type = "response"))   # probabilities in [0, 1]
head(predict(fit, type = "class"))      # 0/1

Factor predictors

Factors are auto-dummied by model.matrix(). The training factor levels are stored on the fit so predict(newdata = ...) still works when some levels are missing from the test data.

library(fbrglm)

set.seed(3)
n <- 200
train <- data.frame(
    y  = rnorm(n),
    x1 = rnorm(n),
    g  = factor(sample(c("A", "B", "C"), n, replace = TRUE),
                levels = c("A", "B", "C"))
)
fit <- fbrglm(y ~ x1 + g, data = train,
              family = "gaussian",
              lambda = "fix", lambda_value = 0.05)

# newdata missing level "C" — still works
test <- data.frame(
    x1 = rnorm(10),
    g  = factor(rep(c("A", "B"), 5), levels = c("A", "B", "C"))
)
predict(fit, newdata = test, type = "response")

By default, predict() errors on a factor value the model has not seen — same as stats::predict.glm(). Production-style batch scoring can opt into a softer mode:

# `test_unseen$g` contains a new level "D" not in the training data;
# rows with that value get NA, the rest score normally, and a warning
# names how many rows were dropped.
predict(fit, newdata = test_unseen, type = "response",
        on_new_levels = "na")

API

lambda selection

fbrglm() exposes three rules through one argument:

lambda meaning backend
"cv_min" cv.glmnet()$lambda.min (default) glmnet::cv.glmnet()
"cv_1se" cv.glmnet()$lambda.1se glmnet::cv.glmnet()
"fix" uses lambda_value directly glmnet::glmnet(lambda = ...)

lambda = "fix" requires lambda_value (numeric). The numeric λ that the fit actually used is always available as fit$lambda_value.

predict(type = ...)

For the single-response GLM families (gaussian, binomial, poisson, Gamma, negative.binomial, cox):

type gaussian / Gamma / NB binomial poisson cox
"link" η η (logit) η (log) η (linear predictor)
"response" mean (link⁻¹η) probability ∈ [0, 1] rate, exp(η) relative risk, exp(η)
"class" error 0 / 1 (threshold 0.5) error error

Cox predictions are on the linear-predictor / relative-risk scale, not absolute hazards. fbrglm does not estimate the baseline hazard; absolute survival curves require a separate baseline-hazard step (e.g. via survival::basehaz() on a coxph fit, or an equivalent Breslow estimate computed against the coxnet linear predictors).

For the multi-response families:

See vignette("fbrglm-families") for worked examples.

Complete-case bookkeeping: nobs_info

Complete-case filtering happens automatically. The numbers are exposed at fit$nobs_info:

fit$nobs_info$n_total            # rows in the input data
fit$nobs_info$n_dropped_missing  # rows with any NA in the model.frame
fit$nobs_info$n_used             # rows actually fit
nobs(fit)                        # same as $n_used

If any rows are dropped, fbrglm() prints a one-line message.

Inference: only infer = "none" for now

Only infer = "none" is currently enabled, and summary() deliberately does not report classical SE, z, p-values, or confidence intervals: shrinkage bias, data-driven λ selection, and active-set conditioning all break the textbook interpretation of those quantities. The summary() output instead carries a permanent footer naming the three failure modes and the planned remediation paths (infer = "split" for sample-split refits, infer = "selective" for selective inference at the chosen λ). coef() returns the regularized point estimates with NA for any column dropped by the QR-pivot rank check; summary() adds the glm-style “(N not defined because of singularities: ...)” header when that happens, the complete-case nobs_info triple, and the inference policy footer.

Planned (not yet implemented)

Full list and rationale: TODO.md.

Vignettes

Reproducible experiments

Smoke tests and benchmarks live in a separate repository: https://github.com/dsc-chiba-u/fbrglm-experiments.

Currently it contains:

Comparison methods covered in the small benchmarks:

Headline observations from the prediction-failure benchmark, in two factor-level scenarios:

License

MIT — see LICENSE.