Quick start with radEmu

Sarah Teichman and Amy Willis

2026-03-20

Setup

First, we will install radEmu, if we haven’t already.

# if (!require("remotes", quietly = TRUE))
#     install.packages("remotes")
# 
# remotes::install_github("statdivlab/radEmu")

Next, we can load radEmu, as well as dplyr.

library(radEmu)
library(dplyr)

In this vignette we’ll explore a dataset published by Wirbel et al. (2019). This is a meta-analysis of case-control studies of colorectal cancer, meaning that Wirbel et al. collected raw sequencing data from studies other researchers conducted and re-analyzed it.

Wirbel et al. published two pieces of data we’ll focus on:

data("wirbel_sample")
# encode control as the baseline level
wirbel_sample$Group <- factor(wirbel_sample$Group, levels = c("CTR","CRC"))
dim(wirbel_sample)
#> [1] 566  14
data("wirbel_otu")
dim(wirbel_otu)
#> [1] 566 845

We can see that we have \(566\) samples and \(845\) mOTUs.

While in general we would fit a model to all mOTUs, we are going to subset to some specific genera for the purposes of this tutorial. Let’s look at Eubacterium, Porphyromonas, Faecalibacteria, and Fusobacterium for now.

mOTU_names <- colnames(wirbel_otu)
chosen_genera <- c("Eubacterium", "Faecalibacterium", "Fusobacterium", "Porphyromonas")
mOTU_name_df <- data.frame(name = mOTU_names) %>% 
  mutate(base_name = stringr::str_remove(mOTU_names, "unknown ") %>%
                      stringr::str_remove("uncultured ")) %>%
  mutate(genus_name = stringr::word(base_name, 1))
restricted_mOTU_names <- mOTU_name_df %>%
  filter(genus_name %in% chosen_genera) %>%
  pull(name)

Again, while we would generally fit a model using all of our samples, for this tutorial we are only going to consider data from a case-control study from China.

ch_study_obs <- which(wirbel_sample$Country %in% c("CHI"))

Next, we want to confirm that all samples have at least one non-zero count across the categories we’ve chosen and that all categories have at least one non-zero count across the samples we’ve chosen.

small_Y <- wirbel_otu[ch_study_obs, restricted_mOTU_names]
sum(rowSums(small_Y) == 0) # no samples have a count sum of 0 
#> [1] 0
sum(colSums(small_Y) == 0) # one category has a count sum of 0
#> [1] 1

category_to_rm <- which(colSums(small_Y) == 0)
small_Y <- small_Y[, -category_to_rm]
sum(colSums(small_Y) == 0)
#> [1] 0

Differential abundance analysis with radEmu

We now are ready to fit a radEmu model with the emuFit() function! This will take 1-2 minutes to run, so we suggest you start running it and then read below about the arguments!

mod <- emuFit(data = wirbel_sample[ch_study_obs, ],
              Y = small_Y,
              formula = ~ Group + Gender, 
              test_kj = data.frame(k = 2, j = 1))

Some of the important arguments for emuFit() are the following:

In the example above we only test the first taxon, because each test for a data set this size takes approximately 30 seconds. Typically you’ll want to test all taxa that you are interested in, so you may need to leave this running for a few hours, or check out the “parallel_radEmu” vignette to see how you could parallelize these score tests if you have access to additional computing resources.

Now that we’ve fit the model and run a robust score test for the first taxon, we can look at the results! The parameter estimates and any test results are in the coef part of the emuFit object.

head(mod$coef)
#>   covariate                                                category
#> 1  GroupCRC Fusobacterium nucleatum s. vincentii [ref_mOTU_v2_0754]
#> 2  GroupCRC  Fusobacterium nucleatum s. animalis [ref_mOTU_v2_0776]
#> 3  GroupCRC Fusobacterium nucleatum s. nucleatum [ref_mOTU_v2_0777]
#> 4  GroupCRC         Faecalibacterium prausnitzii [ref_mOTU_v2_1379]
#> 5  GroupCRC                  Eubacterium siraeum [ref_mOTU_v2_1387]
#> 6  GroupCRC                      Eubacterium sp. [ref_mOTU_v2_1395]
#>   category_num    estimate        se      lower     upper score_stat       pval
#> 1            1  1.49138209 0.8716988 -0.2171162 3.1998804    4.33152 0.03741283
#> 2            2  2.27272171 0.8101394  0.6848777 3.8605657         NA         NA
#> 3            3  3.02032947 1.0143668  1.0322070 5.0084519         NA         NA
#> 4            4 -0.35942098 0.3434753 -1.0326202 0.3137782         NA         NA
#> 5            5  0.03955907 0.4622464 -0.8664272 0.9455454         NA         NA
#> 6            6  1.19138324 0.9106672 -0.5934917 2.9762582         NA         NA

The first taxon in our model is Fusobacterium nucleatum s. vincentii, which has a log fold-difference estimate of \(1.49\). We will interpret this to mean that we expect that Fusobacterium nucleatum s. vincentii is \(exp(1.49) \approx 4.44\) times more abundant in cases of colorectal cancer compared to controls of the same gender, when compared to the typical fold-differences in abundances of the taxa in this analysis.

We could similarly interpret the log fold-differences for each taxon in our analysis.

For Fusobacterium nucleatum s. vincentii we have a robust score test p-value of \(0.04\). This means that we do have enough evidence to reject the null hypothesis that the fold-difference in abundance of Fusobacterium nucleatum s. vincentii between cases and controls is the same as the typical fold-difference in abundance between cases and controls across all taxa in this analysis. When we say ``typical” here we mean approximately the median fold-difference across taxa.

Now you are ready to start using radEmu! We recommend our other vignettes for a deeper look using radEmu, including for phyloseq or TreeSummarizedExperiment objects, with clustered data, and in parallel.