| Type: | Package |
| Title: | Linear Regression with Missing Data |
| Version: | 0.0.1 |
| Description: | Provides methods for linear regression in the presence of missing data, including missingness in covariates and responses. The package implements two estimators: oss_estimator(), a low-dimensional semi-supervised method, and dantzig_missing(), a high-dimensional approach. The tuning parameter can be selected automatically via cv_dantzig_missing(). See Risebrow and Berrett (2026) <doi:10.48550/arXiv.2602.13729>. Optional support for the 'gurobi' optimizer via the 'gurobi' R package (available from Gurobi, see https://docs.gurobi.com/projects/optimizer/en/current/reference/r.html). |
| Imports: | MASS, stats, Rglpk, fastDummies, Rdpack |
| Suggests: | gurobi |
| RdMacros: | Rdpack |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2.9000 |
| URL: | https://github.com/benrisebrow/LRMiss |
| BugReports: | https://github.com/benrisebrow/LRMiss/issues |
| NeedsCompilation: | no |
| Packaged: | 2026-02-17 10:26:59 UTC; u5646697 |
| Author: | Benedict Risebrow [aut, cre], Thomas Berrett [aut] |
| Maintainer: | Benedict Risebrow <Benedict.risebrow@warwick.ac.uk> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-20 08:10:10 UTC |
Cross-validated Dantzig estimator with missing covariates
Description
Performs K-fold cross-validation for the Dantzig selector in linear regression models with missing covariates. The method optionally incorporates unlabelled covariate data to improve estimation of second-moment matrices. This function is based on Section 3 of Risebrow and Berrett (2026).
Usage
cv_dantzig_missing(
X, y, X_unlabeled = NULL,
lambdas = NULL, nlambda = 30, lambda_min_ratio = 1e-3,
K = 5, standardise = TRUE, gurobi = FALSE,
seed = 123, fold_ids = NULL, verbose = TRUE,
plot_path = TRUE
)
Arguments
X |
Labelled covariates. |
y |
Response variables for the labelled data. |
X_unlabeled |
Optional unlabeled covariates. |
lambdas |
Optional sequence of regularisation parameters. |
nlambda |
Number of lambdas if |
lambda_min_ratio |
Smallest lambda as a fraction of the largest. |
K |
Number of cross-validation folds. |
standardise |
Logical; if TRUE covariates are standardised. |
gurobi |
Logical; if TRUE uses Gurobi to solve the linear programs. |
seed |
Random seed for fold assignment. |
fold_ids |
Optional fold assignments for labelled or combined data. |
verbose |
Logical; print progress messages. |
plot_path |
Logical; if TRUE computes and plots the solution path. |
Details
For each candidate value of the regularisation parameter, the Dantzig selector is fitted using moment estimates computed from the training folds. Prediction performance is assessed on held-out folds via the maximum absolute moment mismatch. The tuning parameter is selected using both the minimum mean cross-validation score and the one-standard-error (1-SE) rule.
Value
A named list with the following components:
- lambdas
Numeric vector of tuning parameters used.
- cv_scores_matrix
Numeric matrix of cross-validation scores (folds × lambdas).
- mean_scores
Mean CV score for each lambda.
- se_scores
Standard error of CV scores for each lambda.
- lambda_min_mean
Lambda minimising mean CV score.
- lambda_1se
Lambda chosen by the 1-SE rule.
- beta_path
Optional coefficient path matrix (present if
plot_path=TRUE).- design_colnames
Optional design column names (matching
beta_pathrows).- beta_est
Optional saved coefficient vector from full-data path.
- intercept_est
Optional saved intercept corresponding to
beta_est.
References
Risebrow BM, Berrett TB (2026). “Semi-supervised linear regression with missing covariates.” arXiv:2602.13729.
Examples
set.seed(1)
n <- 50; p <- 5
X <- matrix(rnorm(n * p), n, p)
y <- X[, 1] + 0.5 * X[, 2] + rnorm(n)
X_unlabeled <- matrix(rnorm(100 * p), 100, p)
cv_fit <- cv_dantzig_missing(
X = X,
y = y,
X_unlabeled = X_unlabeled,
K = 5,
nlambda = 20
)
cv_fit$lambda_1se
Dantzig estimator with missing covariates
Description
High-dimensional linear regression estimator based on the Dantzig selector
that accommodates missing covariates and optionally leverages unlabelled
covariate data. This function is a user-facing wrapper that dispatches to
either a standardised or unstandardised implementation depending on the
value of standardise. This function is based on
Section 3 of Risebrow and Berrett (2026).
Usage
dantzig_missing(
X_labeled, y, X_unlabeled = NULL, lambda,
gurobi = FALSE, standardise = TRUE
)
Arguments
X_labeled |
Numeric matrix or data.frame of labelled covariates, with rows corresponding to observations and columns to covariates. Missing values are allowed. |
y |
Numeric response vector of length |
X_unlabeled |
Optional numeric matrix or data.frame of unlabelled covariates. If supplied, these observations are used only for estimating second moments of the covariates and do not contribute to the response. |
lambda |
Positive numeric scalar giving the Dantzig regularisation parameter. |
gurobi |
Logical; if TRUE, the linear programs are solved using the gurobi optimizer (a valid Gurobi installation and license are required). If FALSE, the open-source solver from Rglpk is used instead. |
standardise |
Logical; if TRUE, covariates are standardised prior to estimation and the resulting coefficients are mapped back to the original scale with an intercept term returned. |
Details
Categorical covariates are internally dummy-encoded, with missing values
preserved. When standardise = TRUE, covariates are centred and scaled
using empirical means and standard deviations computed from the combined
labelled and unlabelled samples.
Value
A list with at least the following component:
- beta_hat
Numeric vector of estimated regression coefficients, with names corresponding to the encoded design matrix columns.
If standardise = TRUE, the list also contains:
- intercept
Numeric scalar giving the estimated intercept term.
References
Risebrow BM, Berrett TB (2026). “Semi-supervised linear regression with missing covariates.” arXiv:2602.13729.
Examples
set.seed(1)
n <- 50; p <- 5
X_full <- matrix(rnorm(n * p), n, p)
beta_true <- c(1, 0.5, rep(0, p - 2))
y <- X_full[, 1] * beta_true[1] + X_full[, 2] * beta_true[2] + rnorm(n)
# introduce missingness into covariates
X_miss <- X_full
X_miss[sample(length(X_miss), size = 0.1 * length(X_miss))] <- NA
# fit Dantzig estimator (example lambda; tune in practice)
fit <- dantzig_missing(
X_labeled = X_miss,
y = y,
lambda = 0.1,
standardise = TRUE
)
fit$beta_hat
Covariance estimator with missing data
Description
Estimates the covariance matrix of a design matrix in the presence of missing values. Each covariance entry is computed using all observations for which the corresponding pair of covariates is jointly observed.
Usage
estimate_cov_raw(X)
Arguments
X |
Numeric matrix (or object coercible to a matrix) containing covariates.
Rows correspond to observations and columns to variables. Missing values
( |
Details
Let X_{ij} denote the j-th covariate for observation i.
For each pair of variables (j, k), the covariance estimate is
\hat{\Sigma}_{jk}
=
\frac{1}{n_{jk}} \sum_{i : X_{ij}, X_{ik} \ \mathrm{observed}} X_{ij} X_{ik},
where n_{jk} is the number of observations for which both entries are
observed. If no such observations exist, the corresponding covariance entry
is set to NA.
This estimator is symmetric by construction and reduces to the usual sample second-moment matrix when the data contain no missing values.
Value
A numeric p x p matrix containing the estimated covariance matrix,
where p = ncol(X). Entries corresponding to variable pairs that are
never jointly observed are NA.
Examples
set.seed(1)
X <- matrix(rnorm(25), 25, 5)
X[sample(length(X), 10)] <- NA
Sigma_hat <- estimate_cov_raw(X)
Sigma_hat
Linear regression with missing data
Description
Fits a linear regression model in the presence of missing covariates and/or missing responses using the OSS (Ordinary Semi-Supervised) estimator. This corresponds to Section 2 of Risebrow and Berrett (2026). The method exploits partially observed covariates and optionally unlabelled observations to improve estimation efficiency. If sufficient complete cases are present, the weights are estimated from them, otherwise the weights are estimated using an initial consistent estimate. If sufficient unlabelled data is present the covariance matrix is estimated exclusively from them, otherwise the covariance is estimated elementwise.
Usage
oss_estimator(formula, data,
all_weights_one = FALSE,
crossfitting = FALSE)
Arguments
formula |
A model formula specifying the linear regression, e.g. |
data |
A data.frame containing the variables in the model. Rows with missing responses are treated as unlabelled observations. |
all_weights_one |
Logical; if TRUE, all missingness-pattern weights are set to one, yielding an unweighted OSS estimator. |
crossfitting |
Logical; if TRUE, a two-fold cross-fitted version of the OSS estimator is used. |
Value
An invisible list with components:
- coef
Numeric vector of estimated regression coefficients.
- sigma2_hat
Estimated noise variance, or
NAif not computed.- weights
Named vector of weights associated with each missingness pattern.
- groups
Data.frame mapping labelled observations to missingness patterns.
- beta_cc
Complete-case coefficient estimates if used, otherwise
NULL.
References
Risebrow BM, Berrett TB (2026). “Semi-supervised linear regression with missing covariates.” arXiv:2602.13729.
Examples
dat <- data.frame(
y = c(1.0, NA, 2.3, 0.5),
x1 = rnorm(4),
x2 = rnorm(4)
)
## Without cross-fitting
res <- oss_estimator(y ~ x1 + x2, dat)
## With cross-fitting
res_cf <- oss_estimator(y ~ x1 + x2, dat, crossfitting = TRUE)