An Introduction to rgm

This vignette is designed to guide users through the process of simulating data and then running an experiment with the RGM package. We will start by demonstrating how to simulate data, which is a crucial preliminary step. Following this, we will showcase how to run the estimation process on the simulated data. The focus will be on interpreting various plots generated from the experiment, providing insights into the performance and accuracy of the RGM package.

Simulating Data

The first step in the analysis is to simulate data, which forms the basis for our subsequent experiment. The RGM package offers functionalities for simulating data from a random graphical model, mimicking real-world network structures that show similarities across a number of environments.

For the simulation of this experiment and using the terminology of human microbiota systems, we consider \(B=13\) distinct body sites, each representing a different environment, and \(p=87\) microbes identified as Operational Taxonomic Units (OTUs). For each environment \(k\), where \(k=1,\ldots,B\), let \(\mathbf{Y}^{(k)} = (Y^{(k)}_1, \ldots, Y^{(k)}_p)\) denote the \(p\)-dimensional random vector of OTU abundances. The relationship among these OTUs within each environment is modeled using the following Gaussian graphical model (the implementation allows for discrete marginal distributions, but we do not consider these in the simulation for simplicity): \[\begin{equation} \mathbf{Y}^{(k)} | G^{(k)} \sim \mathcal{N}_{p}(\mathbf{0},\boldsymbol{\Omega}^{(k)}), \end{equation}\] with \(\boldsymbol{\Omega}^{(k)}\) the precision matrix associated to condition \(k\).

We denote with \(G^{(k)}\) the conditional independence graph for environment \(k\). This is given by the non-zero pattern in \(\boldsymbol{\Omega}^{(k)}\). Then, the collection of graphs \(G = \{G^{(k)}\}_k\) across all environments is assumed to be distributed according to a random graph model. For the simulation, we consider the following latent probit model \[\begin{equation} \label{eq:latentprobit} P({G_{j_1,j_2}}^{(k)}=1~|~G_{j_1,j_2}^{(-k)}, \Theta, w)= \Phi\Big(\alpha_k+{w_{j_1,j_2}}\beta+\mathbf{c}_k^t\sum_{k' \ne k}\mathbf{c}_{k'}1_{\{{G_{j_1,j_2}}^{(k')}=1\}}\Big), \tag{1} \end{equation}\] with environment specific intercepts \(\alpha_k\), one edge covariate \(W\) and a 2D latent space for the environments. This is in general the model that is implemented in the package, but multiple edge covariates can also be considered.

Data from the model above, with \(n=346\) observations for each environment, is simulated by:

# Running the simulation with specific parameters
a <- rgm:::sim.rgm(p=27, B=5, n=146, mcmc_iter = 100, seed=1234)

Estimation

Given the simulated data, the next phase is to apply the RGM model to this data. This step involves estimating the network structure and relationships from the data we generated.


# Fitting the model
res <- rgm:::rgm(a$data, X=a$X, iter=10000)

This section provides an in-depth analysis of the simulation results using various plots. We will interpret the plots to understand the model’s behavior and accuracy.

Results

We proceed with the diagnostics plots


ps = rgm:::post_processing_rgm(simulated_data = a,results = res)

We first observe the convergence of \(\beta\) acriss the MCMC iterations

ps$beta_convergence

RGM Recovery Plot

The simulation study results are based on the analysis of the last 2500 MCMC iterations. We compare the true probit probabilities, as derived from Equation 1 using the true values of the parameters, with those calculated using the mean posterior estimates of the parameters \(\boldsymbol{\alpha}\), \(\beta\), and \(\mathbf{c}\).


ps$rgm_recovery

RGM Recovery Plot

Next, Receiver Operating Characteristic (ROC) curves compare the recovered graphs with the true graphs for each of the 13 environments. These curves are generated by varying thresholds on the inferred posterior edge probabilities. Different colors represent each of the 13 environments.

ps$roc_plot

ROC Plot

The following plot compares the true values of \(\alpha_k\) with the mean posterior estimates. The good agreement between the two shows how the procedure is able to recover the right level of sparsity of the inferred graphs in each environment.

ps$estimation_of_alpha

Estimation of Alpha

In the next plot, we visualize the posterior distribution of \(\beta\) and compare the mean posterior estimate with the true value via vertical lines.

ps$posterior_distribution

Posterior Distribution of Beta

The following heatmap visualizes the posterior edge probabilities for each edge and each environment, with rows and columns re-arranged via hierarchical clustering for a clearer pattern identification. Blue corresponds to probabilities close to 1, while red to probabilities close to zero.

ps$edge_prob

Edge Probability Heatmap

The graph shows the sparsity levels of the graphs as well as the similarities between the environments in terms of network structures