fbrglm is forfbrglm is a formula-based front-end for regularized
generalized linear models. Internally it delegates the fit to glmnet;
the wrapper’s job is to make the user-facing experience look like base
R’s glm() — a formula +
data.frame, automatic factor handling, complete-case
filtering, and the familiar S3 methods (print,
summary, coef, predict,
nobs, plot).
The MVP described here is infer = "none": it returns
regularized point estimates and does not report
classical standard errors, z values, p values, or confidence intervals.
Honest post-selection inference (via data splitting or selective
inference) is on the roadmap; see the package TODO.md.
n <- 150
dat <- data.frame(
y = rbinom(n, 1, 0.5),
x1 = rnorm(n),
x2 = rnorm(n),
x3 = rnorm(n)
)
fit <- fbrglm(y ~ x1 + x2 + x3, data = dat,
family = "binomial",
lambda = "cv_min")print() shows the call and the basics of the fit:
print(fit)
#> Formula-based regularized GLM (fbrglm)
#>
#> Call:
#> fbrglm(formula = y ~ x1 + x2 + x3, data = dat, family = "binomial",
#> lambda = "cv_min")
#>
#> Family: binomial
#> Alpha: 1
#> Lambda rule: cv_min
#> Lambda value: 0.05370831
#> Inference: none
#> Observations: total = 150, dropped = 0, used = 150
#> Non-zero terms: 0 / 3summary() returns a structured object that includes the
call, family, chosen λ, complete-case bookkeeping, and the (regularized)
coefficient vector with zeros included:
summary(fit)
#> fbrglm summary
#> ==============
#>
#> Call:
#> fbrglm(formula = y ~ x1 + x2 + x3, data = dat, family = "binomial",
#> lambda = "cv_min")
#>
#> Family: binomial
#> Lambda method: cv_min
#> Lambda value: 0.05370831
#> Inference: none
#>
#> Observations:
#> total = 150, dropped (missing) = 0, used = 150
#>
#> Coefficients:
#> Estimate
#> (Intercept) -0.08
#> x1 0.00
#> x2 0.00
#> x3 0.00
#>
#> Non-zero predictors (0): (none)
#>
#> Note: no standard errors, z-values, or p-values are reported under
#> infer = "none". Classical inference does not account for the
#> shrinkage bias from L1/L2 penalisation or for the data-driven
#> selection of lambda. Use infer = "split" / "selective" (planned)
#> for valid post-selection inference; see vignette("fbrglm").Coefficients and predictions follow the same shapes you’d expect from
glm():
coef(fit)
#> (Intercept) x1 x2 x3
#> -0.08004271 0.00000000 0.00000000 0.00000000
head(predict(fit, newdata = dat[1:5, ], type = "response"))
#> [1] 0.48 0.48 0.48 0.48 0.48A plot() method is registered; it delegates to
plot.cv.glmnet() when λ was chosen by cross-validation, and
to plot.glmnet() otherwise.
lambdaThere are three rules, exposed through a single argument:
fit_min <- fbrglm(y ~ x1 + x2 + x3, data = dat,
family = "binomial", lambda = "cv_min")
fit_1se <- fbrglm(y ~ x1 + x2 + x3, data = dat,
family = "binomial", lambda = "cv_1se")
fit_fix <- fbrglm(y ~ x1 + x2 + x3, data = dat,
family = "binomial",
lambda = "fix", lambda_value = 0.05)
c(cv_min = fit_min$lambda_value,
cv_1se = fit_1se$lambda_value,
fix = fit_fix$lambda_value)
#> cv_min cv_1se fix
#> 0.05370831 0.05370831 0.05000000"cv_min" and "cv_1se" go through
cv.glmnet(); "fix" skips CV and goes straight
to glmnet() at the supplied lambda_value. The
numeric λ actually used is always available at
fit$lambda_value.
Factor columns are auto-dummied via model.matrix(), and
the training factor levels are stored on the fit object so
predict(newdata = ...) can rebuild a design matrix that
matches the training column structure — even when some training levels
are missing from newdata.
n_train <- 200
train <- data.frame(
y = rbinom(n_train, 1, 0.5),
x1 = rnorm(n_train),
g = factor(sample(c("A", "B", "C"), n_train, replace = TRUE),
levels = c("A", "B", "C"))
)
fit_f <- fbrglm(y ~ x1 + g, data = train,
family = "binomial",
lambda = "fix", lambda_value = 0.05)
## newdata is missing level "C"
test <- data.frame(
x1 = rnorm(10),
g = factor(rep(c("A", "B"), 5), levels = c("A", "B", "C"))
)
head(predict(fit_f, newdata = test, type = "response"))
#> [1] 0.56 0.56 0.56 0.56 0.56 0.56fbrglm also tolerates the narrower case where
newdata’s factor has its levels narrowed
(not just its values): missing one-hot columns are padded with zeros
before being handed to glmnet.
fbrglm() drops rows with any NA from the
modelling frame, prints a one-line note, and records the counts on the
fit object under fit$nobs_info.
dat_na <- dat
dat_na$y[1:5] <- NA
fit_na <- fbrglm(y ~ x1 + x2 + x3, data = dat_na,
family = "binomial",
lambda = "fix", lambda_value = 0.05)
#> fbrglm: dropped 5 row(s) with missing values (145 / 150 kept).
fit_na$nobs_info
#> $n_total
#> [1] 150
#>
#> $n_dropped_missing
#> [1] 5
#>
#> $n_used
#> [1] 145
nobs(fit_na)
#> [1] 145offset at fit time goes through to
glmnet(); at predict time, pass newoffset of
matching length. With newdata = NULL the stored training
offset is reused; with newdata supplied, an explicit
newoffset is required.
n_off <- 80
dat_off <- data.frame(
y = rbinom(n_off, 1, 0.5),
x1 = rnorm(n_off),
x2 = rnorm(n_off)
)
fit_off <- fbrglm(y ~ x1 + x2, data = dat_off, family = "binomial",
offset = rep(0.2, n_off),
lambda = "fix", lambda_value = 0.05)
head(predict(fit_off, type = "response")) # reuses training offset
#> [1] 0.5625 0.5625 0.5625 0.5625 0.5625 0.5625
head(predict(fit_off, newdata = dat_off[1:5, ],
newoffset = rep(0.2, 5), type = "response"))
#> [1] 0.5625 0.5625 0.5625 0.5625 0.5625glmnet objectsIf you need to use a glmnet-specific tool, two accessors
get you out of the wrapper:
class(as_glmnet(fit_min))
#> [1] "lognet" "glmnet"
class(as_cv_glmnet(fit_min))
#> [1] "cv.glmnet"
class(as_glmnet(fit_fix))
#> [1] "lognet" "glmnet"
as_cv_glmnet(fit_fix) # NULL — no CV was run
#> NULLas_glmnet() returns the underlying glmnet
object (the $glmnet.fit slot when the wrapper used CV).
as_cv_glmnet() returns the cv.glmnet object,
or NULL for the "fix" λ path.
The MVP is deliberately narrow:
infer = "none" is implemented;
"split" and "selective" are planned but not in
this release.gaussian, binomial,
poisson only. multinomial and cox
will land later.x / y direct-matrix entry point is
reserved but not yet supported — supply formula +
data instead.glm()-style standard errors, z / p values,
and confidence intervals are intentionally not shown
for infer = "none". Doing so naively for regularized
estimators would be misleading; honest inference is the next
milestone.Reproducible benchmarks against raw glmnet,
glmnetUtils, and a parsnip /
workflows pipeline with the glmnet engine live
in a separate repository: https://github.com/dsc-chiba-u/fbrglm-experiments. In
the prediction-failure case (narrowed test factor levels), raw
glmnet built naively can fail; parsnip /
workflows succeeds but with higher runtime overhead than
fbrglm in the tested small-data setting. See the experiments repo for
the CSVs and figures.