Baseline models generate synthetic gene expression data from metadata alone. You describe the biological conditions—tissue type, disease state, perturbations, cell type, etc.—and the model generates realistic expression profiles matching those conditions.
This is the most common use case: generating synthetic data for conditions where real data may be scarce or unavailable.
gem-1-bulk: Bulk RNA-seq baseline
modelgem-1-sc: Single-cell RNA-seq baseline
modelThe structure of the query required by the API is specific to each
model. Use get_example_query() to get a correctly
structured example for your chosen model.
# Get the example query structure for a specific model
example_query <- get_example_query(model_id = "gem-1-bulk")$example_query
# Inspect the query structure
str(example_query)The query consists of:
sampling_strategy: The prediction mode
that controls how expression data is generated:
inputs: A list of biological
conditions to generate data forEach input contains metadata (describing the biological
sample) and num_samples (how many samples to generate).
Once your query is ready, send it to the API to generate gene expression data:
# Create a query for the bulk model
query <- get_example_query(model_id = "gem-1-bulk")$example_query
# Submit and get results
result <- predict_query(query, model_id = "gem-1-bulk")The result is a list containing two data frames:
metadata and expression.
In addition to metadata, queries support several optional parameters that control the generation process.
Controls the type of prediction the model generates. This parameter is required in all queries.
Available modes:
“sample generation”: The model generates realistic-looking synthetic data that captures measurement error. This mode is useful when you want data that mimics real experimental measurements. (Bulk only)
“mean estimation”: The model creates a distribution capturing biological heterogeneity consistent with the supplied metadata, then returns the mean of that distribution. This mode is useful when you want a stable estimate of expected expression levels. (Bulk and single-cell)
# Bulk query with sample generation
bulk_query <- get_example_query(model_id = "gem-1-bulk")$example_query
bulk_query$sampling_strategy <- "sample generation"
# Bulk query with mean estimation
bulk_query_mean <- get_example_query(model_id = "gem-1-bulk")$example_query
bulk_query_mean$sampling_strategy <- "mean estimation"
# Single-cell query (must use mean estimation)
sc_query <- get_example_query(model_id = "gem-1-sc")$example_query
sc_query$sampling_strategy <- "mean estimation" # Required for single-cellLibrary size used when converting predicted log CPM back to raw counts. Higher values scale counts up proportionally.
If TRUE, the model uses the mean of each latent
distribution (p(z|metadata)) instead of sampling. This
removes randomness from latent sampling and produces deterministic
outputs for the same inputs.
FALSE (sampling is enabled)Random seed for reproducibility when using stochastic sampling.
You can combine multiple parameters in a single query:
The input metadata is a list of lists. Here is the full list of valid metadata keys:
age_yearscell_line_ontology_idcell_type_ontology_iddevelopmental_stagedisease_ontology_idethnicitygenotyperacesample_type (“cell line”, “organoid”, “other”, “primary
cells”, “primary tissue”, “xenograft”)sex (“male”, “female”)tissue_ontology_idperturbation_doseperturbation_ontology_idperturbation_timeperturbation_type (“coculture”, “compound”, “control”,
“crispr”, “genetic”, “infection”, “other”, “overexpression”, “peptide or
biologic”, “shrna”, “sirna”)study (Bioproject ID)library_selection (e.g., “cDNA”, “polyA”, “Oligo-dT” -
see https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-selection)library_layout (“PAIRED”, “SINGLE”)platform (“illumina”)The following are the valid values or expected formats for selected metadata keys:
| Metadata Field | Requirement / Example |
|---|---|
cell_line_ontology_id |
Requires a Cellosaurus ID. |
cell_type_ontology_id |
Requires a CL ID. |
disease_ontology_id |
Requires a MONDO ID. |
perturbation_ontology_id |
Must be a valid Ensembl gene ID (e.g.,
ENSG00000156127), ChEBI ID (e.g.,
CHEBI:16681), ChEMBL ID (e.g.,
CHEMBL1234567), or NCBI Taxonomy ID (e.g.,
9606). |
tissue_ontology_id |
Requires a UBERON ID. |
We highly recommend using the EMBL-EBI Ontology Lookup Service to find valid IDs for your metadata.
Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error.
You can customize the query inputs to fit your specific research needs:
# Get a base query
query <- get_example_query(model_id = "gem-1-bulk")$example_query
# Adjust number of samples for the first input
query$inputs[[1]]$num_samples <- 10
# Add a new condition
query$inputs[[3]] <- list(
metadata = list(
sex = "male",
sample_type = "primary tissue",
tissue_ontology_id = "UBERON:0002371"
),
num_samples = 5
)