Introduction to the iimi package

Haochen Ning

2023-04-29

Introduction

This vignette aims to give an introduction on how to use the iimi package for plant virus diagnostics and how to visualize the coverage profile for the sample mapping. We also included a tutorial on creating unreliable regions.

First, we will load necessary packages before we start any analysis.

library(iimi)

library(Biostrings)

Note that we need sample mapping results to get started creating coverage profiles and feature-extracted data frame. We used Bowtie 2 to map the samples against the official Virtool virus data base. You can choose from Bowtie 2 or minimap2 since we have tried both and they yield similar result with minimap2 having a slight decrease. We let both software to report all alignments (-a mode for Bowtie 2, --secondary=yes for minimap2).

Converting BAM file(s) into coverage profiles and feature-extracted data frame

First, we convert the BAM file(s) into coverage profiles and feature-extracted data frame.

We will use the coverage profiles to visualize the mapping information. The feature-extracted data frame will be used in the model training and testing process.

Note that both training and testing data need to go through the conversion step. In our example, we stored the conversion for both the testing and training datasets in the same object. You can do the conversion separately for your data.

Important: the example code does not work unless your own path to the folder that stores your BAM files is provided.

  1. State the path to the folder of your BAM files. If you already have coverage profiles in run-length encoding (RLE) format, go to step 2.2.
   path_to_bamfiles <- list.files(
     path = path/to/your/BAM/files/folder,
     pattern = "bam$", full.names = TRUE, 
     include.dirs = TRUE
   )
  1. Create a data frame that contains the coverage profiles.

    1. Convert BAM files to a list of RLEs. You may skip this step if you already have converted them to RLE format.
    cov_info <- convert_bam_to_rle(bam_file = path_to_bamfiles)
    1. Convert the RLE list to a feature-extracted data frame. In this step, you have the option to use the provided mappability profile and nucleotide filtering mode. We recommend to enable the profiling and filtering step as it eliminates false peaks. Examples will be provided in the next section. If you wish to disable this mode, simpling set unreliable_region_enabled = FALSE. If you wish to enable this mode, you do not need extra codes.
    df <- convert_rle_to_df(covs = example_cov)

Visualizing the coverage profiles

Next, we can visualize the coverage profile by using the plot_cov() function.

covs_selected = list()
covs_selected$S1 <-
  example_cov$S1[c("4c559wtw", "2kiu3uzt", "z9hs8khm", "ka4xfvq7")]

oldpar <- par(mfrow = c(1,2))

par(mar = c(1, 2, 1, 1))
layout(matrix(c(1, 1, 2, 5, 5, 6, 3, 3, 4, 7, 7, 8), nrow = 6))
plot_cov(covs = covs_selected)

par(oldpar)

This gives us a general idea of what the potential viruses are.

Predicting the plant sample(s)

To make predictions, use the sample(s) that you wish to detect as the input.

After preparing your test sample, you can choose to test the data using our provided training model or the model you trained using train_iimi().

If you wish to use provided training model:

prediction_default <- predict_iimi(newdata = df)

The detection of your plant sample(s) is finished. The prediction is TRUE if virus infected the sample, FALSE if virus did not infect the sample.

Training your own model

If you would like to train your own model, you can follow the codes below to train a new model with your own data.

Ideally, the number of the samples used to train the model should be bigger than 100. However, since we are only providing a tutorial on how to use the train_iimi function, only eight samples are used to train the model.

First, we need to prepare our training data. We are using a 80/20 split to split the six samples.

# set seed
set.seed(123)

# spliting into 80-20 train and test data set with the ten plant samples
train_names <- sample(levels(as.factor(df$sample_id)),
                      length(unique(df$sample_id)) * 0.8)

# trian data
train_x = df[df$sample_id %in% train_names, ]

train_y = c()

for (ii in 1:nrow(train_x)) {
  train_y = append(train_y, example_diag[train_x$seg_id[ii],
                                         train_x$sample_id[ii]])
}

# test data
test_x = df[df$sample_id %in% train_names == F, ]

test_y = c()

for (ii in 1:nrow(train_x)) {
  test_y = append(test_y, example_diag[train_x$seg_id[ii],
                                       train_x$sample_id[ii]])
}

Then, we plug in the variables into the train_iimi function with the default XGBoost model. You may have other parameters in the

fit <- train_iimi(train_x = train_x, train_y = train_y)

Now, we have a trained model using the toy data.

Then, the process to detect which viruses infect the plant sample(s) is the same as previously described, except we are using a trained model.

prediction_customized <-
  predict_iimi(newdata = test_x,
               trained_model = fit)

The detection of the plant sample(s) is finished. The interpretation is the same as above.

Creating mappability profile and nucleotide filtering data frame

If you would like to create your own mappability profile and high nucleotide content regions, here is a short tutorial.

For the mappability profile:

First, split each of your virus segment into a sliding window series with window size of your choice and with step size 1. The default value for window size is 75.

Then, map one virus segment with each other, until you finish mapping it to all virus segments in the virus database. Also map the virus segment with a host genome of your choice. We chose to use Arabidopsis Thaliana.

After mapping, sort and index the resulted BAM files from the mapping step.

Next, it is time to assemble the mappability profile.

mappability_profile_virus <-
  create_mappability_profile(path/to/bam/files/folder/virus, category = "Unmappable region (virus)")
mappability_profile_host <-
  create_mappability_profile(path/to/bam/files/folder/host, category = "Unmappable region (host)")

For the high nucleotide content regions:

Creating the high nucleotide content regions is much easier than the mappability profile. We only need to use create_high_nucleotide_content() function.

Here is an example:

high_nucleotide_regions <- create_high_nucleotide_content()

The default threshold for GC content is 60% and is 45% for A%. The thresholds are changable.