BigDataStatMeth provides scalable statistical computing for matrices stored in HDF5 files. The package is designed as a two-level tool: it provides a standard R interface for users working with HDF5-backed matrices, and a reusable C++ infrastructure for developers implementing new block-wise statistical methods.
The R interface is based on HDF5Matrix objects and S3
methods, so users can work with familiar R calls such as
dim(), [, %*%,
crossprod(), scale(), cor(),
svd(), prcomp(), qr(),
chol(), and solve(). The C++ infrastructure
provides classes and routines for managing HDF5 files, groups, and
datasets, together with block-wise numerical methods that can be reused
from Rcpp-based code.
flowchart LR
subgraph R["R interface"]
A["HDF5Matrix objects"] --> B["S3 generics: scale, crossprod, svd, qr, prcomp ..."]
end
subgraph CPP["C++ infrastructure"]
C["C++ classes: files, groups, datasets"] --> D["Block-wise numerical routines"]
end
B --> D
D --> E["HDF5 storage on-disk matrices"]
C --> E
Most users will interact with the R/S3 interface. Developers can build on the C++ headers to extend the package with new HDF5-backed methods while retaining efficient execution through compiled code.
HDF5Matrix/S3 interfaceinstall.packages("BigDataStatMeth")# Install devtools if needed
install.packages("devtools")
devtools::install_github("isglobal-brge/BigDataStatMeth")R packages:
MatrixRcppEigenRSpectraSystem dependencies:
library(BigDataStatMeth)
h5file <- tempfile(fileext = ".h5")
set.seed(1)
X <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100)
# Write an in-memory matrix to HDF5
X_h5 <- hdf5_create_matrix(
filename = h5file,
dataset = "data/X",
data = X,
overwrite = TRUE
)
dim(X_h5)
colMeans(X_h5)
# Standard R operations on the HDF5-backed matrix
XtX_h5 <- crossprod(X_h5)
X_sc_h5 <- scale(X_h5)
# Decompositions
svd_res <- svd(X_h5, nu = 5, nv = 5, center = TRUE, scale = TRUE)
pca_res <- prcomp(X_h5, center = TRUE, scale. = TRUE, ncomponents = 5)
close(X_h5)
hdf5_close_all()HDF5Matrix/S3)| Category | Representative calls |
|---|---|
| Core object handling | hdf5_create_matrix(),
hdf5_matrix(), dim(), nrow(),
ncol(), is_open(), close() |
| HDF5 inspection and I/O | list_datasets(),
hdf5_import(), hdf5_import_multiple(),
as.matrix(), as.data.frame() |
| Subsetting and assignment | X[i, j],
X[i, j] <- value |
| Dimension names | rownames(),
colnames(), dimnames() |
| Element-wise arithmetic | X + Y, X - Y,
X * Y, X / Y |
| Matrix algebra | %*%,
crossprod(), tcrossprod(),
cbind(), rbind() |
| Aggregations | colSums(),
rowSums(), colMeans(),
rowMeans(), colVars(), rowVars(),
colSds(), rowSds(), colMins(),
rowMins(), colMaxs(),
rowMaxs() |
| Scalar summaries | mean(), var(),
sd() |
| Normalization and transformations | scale(),
sweep() |
| Correlation | cor() |
| Decompositions | svd(), prcomp(),
eigen(), pseudoinverse() |
| Factorizations and solvers | qr(), chol(),
solve() |
| Diagonal operations | diag(),
diag<-(), diag_op(),
diag_scale() |
| Split, reduce, and apply | split_dataset(),
split(), reduce(),
apply_function() |
| Resource management and options | hdf5matrix_options(),
show_hdf5matrix_options(),
hdf5_close_all() |
bd*)Some utilities do not map directly to an existing base R generic and
retain the bd* prefix. Examples include
bdCreate_hdf5_group(), bdmove_hdf5_dataset(),
and bdWrite_hdf5_dimnames(). These functions are part of
the package API and are documented in their corresponding help
pages.
Common settings for HDF5-backed computations can be configured with
hdf5matrix_options(). These include parallel execution,
number of threads, block size, and HDF5 compression level.
hdf5matrix_options(
paral = TRUE,
threads = 4L,
block_size = 512L,
compression = 6L
)These settings are especially useful for operations dispatched
through standard R generics, where the usual R call does not always
expose all low-level execution parameters. Operation-specific parameters
can also be passed directly when a method supports them (see
?svd.HDF5Matrix, ?prcomp.HDF5Matrix,
?qr.HDF5Matrix).
The C++ API is a central part of BigDataStatMeth. The
package exposes C++ classes for HDF5 files, groups, and datasets, and
implements block-wise routines for matrix algebra, decompositions, and
statistical operations. These are the same building blocks used
internally by the R/S3 interface.
This design allows developers to focus on the statistical or numerical method itself, rather than reimplementing HDF5 file handling, block iteration, or data movement.
#include <Rcpp.h>
#include "BigDataStatMeth.hpp"
using namespace BigDataStatMeth;
// [[Rcpp::export]]
void custom_analysis(std::string filename, std::string group, std::string dataset) {
std::unique_ptr<BigDataStatMeth::hdf5Dataset> ds(nullptr);
ds.reset( new BigDataStatMeth::hdf5Dataset(filename, group, dataset, false ) );
ds->openDataset();
// Block-wise processing using BigDataStatMeth routines
// ...
// ds is automatically closed and released when it goes out of scope
}See Developing Methods for complete examples in both R and C++.
Comprehensive documentation is available at https://isglobal-brge.github.io/BigDataStatMeth/
# List available vignettes
vignette(package = "BigDataStatMeth")
# View the main vignette
vignette("BigDataStatMeth")BigDataStatMeth is designed for efficiency at scale:
BigDataStatMeth is suited for any analytical workflow that involves large matrix operations. Typical scenarios include:
HDF5-backed objects keep file handles open while they are in use.
Objects can be closed individually with close(), and all
open HDF5 handles managed by the package can be closed with
hdf5_close_all().
close(X_h5)
hdf5_close_all()After calling hdf5_close_all(), HDF5-backed objects that
were open should be reopened before being used again. Calling
gc() may also help trigger R finalizers for objects that
are no longer referenced.
If you use BigDataStatMeth in your research, please cite:
Pelegri-Siso D, Gonzalez JR (2026). BigDataStatMeth: Statistical Methods
for Big Data Using Block-wise Algorithms and HDF5 Storage.
R package version 2.0.0, https://github.com/isglobal-brge/BigDataStatMeth
BibTeX entry:
@Manual{bigdatastatmeth,
title = {BigDataStatMeth: Statistical Methods for Big Data},
author = {Dolors Pelegri-Siso and Juan R. Gonzalez},
year = {2026},
note = {R package version 2.0.0},
url = {https://github.com/isglobal-brge/BigDataStatMeth},
}Contributions are welcome. Please:
git checkout -b feature/new-feature)git commit -m 'Add new feature')git push origin feature/new-feature)R CMD check before submittingMIT License — see LICENSE file for details.
Dolors Pelegri-Siso Bioinformatics Research Group in Epidemiology (BRGE) ISGlobal — Barcelona Institute for Global Health
Juan R. Gonzalez Bioinformatics Research Group in Epidemiology (BRGE) ISGlobal — Barcelona Institute for Global Health
Development of BigDataStatMeth was supported by ISGlobal and the Bioinformatics Research Group in Epidemiology (BRGE).