Baseline Models

Creating a Query

The structure of the query required by the API is specific to each model. Use get_example_query() to get a correctly structured example for your chosen model.

# Get the example query structure for a specific model
example_query <- get_example_query(model_id = "gem-1-bulk")$example_query

# Inspect the query structure
str(example_query)

The query consists of:

sampling_strategy: The prediction mode that controls how expression data is generated:
- “sample generation”: Generates realistic-looking synthetic data with measurement error (bulk only)
- “mean estimation”: Provides stable mean estimates of expression levels (bulk and single-cell)
inputs: A list of biological conditions to generate data for

Each input contains metadata (describing the biological sample) and num_samples (how many samples to generate).

Making a Prediction

Once your query is ready, send it to the API to generate gene expression data:

# Create a query for the bulk model
query <- get_example_query(model_id = "gem-1-bulk")$example_query

# Submit and get results
result <- predict_query(query, model_id = "gem-1-bulk")

The result is a list containing two data frames: metadata and expression.

Single-Cell Example

# Create a query for the single-cell model
sc_query <- get_example_query(model_id = "gem-1-sc")$example_query

# Submit and get results
sc_result <- predict_query(sc_query, model_id = "gem-1-sc")

Note: Single-cell models only support "mean estimation" mode.

Query Parameters

In addition to metadata, queries support several optional parameters that control the generation process.

sampling_strategy (character, required)

Controls the type of prediction the model generates. This parameter is required in all queries.

Available modes:

“sample generation”: The model generates realistic-looking synthetic data that captures measurement error. This mode is useful when you want data that mimics real experimental measurements. (Bulk only)
“mean estimation”: The model creates a distribution capturing biological heterogeneity consistent with the supplied metadata, then returns the mean of that distribution. This mode is useful when you want a stable estimate of expected expression levels. (Bulk and single-cell)

# Bulk query with sample generation
bulk_query <- get_example_query(model_id = "gem-1-bulk")$example_query
bulk_query$sampling_strategy <- "sample generation"

# Bulk query with mean estimation
bulk_query_mean <- get_example_query(model_id = "gem-1-bulk")$example_query
bulk_query_mean$sampling_strategy <- "mean estimation"

# Single-cell query (must use mean estimation)
sc_query <- get_example_query(model_id = "gem-1-sc")$example_query
sc_query$sampling_strategy <- "mean estimation" # Required for single-cell

total_count (integer, optional)

Library size used when converting predicted log CPM back to raw counts. Higher values scale counts up proportionally.

Default: 10,000,000 for bulk; 10,000 for single-cell

# Create a query and add custom total_count
query <- get_example_query(model_id = "gem-1-bulk")$example_query
query$total_count <- 5000000

deterministic_latents (logical, optional)

If TRUE, the model uses the mean of each latent distribution (p(z|metadata)) instead of sampling. This removes randomness from latent sampling and produces deterministic outputs for the same inputs.

Default: FALSE (sampling is enabled)

# Create a query and enable deterministic latents
query <- get_example_query(model_id = "gem-1-bulk")$example_query
query$deterministic_latents <- TRUE

seed (integer, optional)

Random seed for reproducibility when using stochastic sampling.

# Create a query with a specific seed
query <- get_example_query(model_id = "gem-1-bulk")$example_query
query$seed <- 42

Combining Parameters

You can combine multiple parameters in a single query:

# Create a query and add multiple parameters
query <- get_example_query(model_id = "gem-1-bulk")$example_query
query$total_count <- 8000000
query$deterministic_latents <- TRUE
query$sampling_strategy <- "mean estimation"

results <- predict_query(query, model_id = "gem-1-bulk")

Valid Metadata Keys

The input metadata is a list of lists. Here is the full list of valid metadata keys:

Biological

age_years
cell_line_ontology_id
cell_type_ontology_id
developmental_stage
disease_ontology_id
ethnicity
genotype
race
sample_type (“cell line”, “organoid”, “other”, “primary cells”, “primary tissue”, “xenograft”)
sex (“male”, “female”)
tissue_ontology_id

Perturbational

perturbation_dose
perturbation_ontology_id
perturbation_time
perturbation_type (“coculture”, “compound”, “control”, “crispr”, “genetic”, “infection”, “other”, “overexpression”, “peptide or biologic”, “shrna”, “sirna”)

Technical

study (Bioproject ID)
library_selection (e.g., “cDNA”, “polyA”, “Oligo-dT” - see https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-selection)
library_layout (“PAIRED”, “SINGLE”)
platform (“illumina”)

Valid Metadata Values

The following are the valid values or expected formats for selected metadata keys:

Metadata Field	Requirement / Example
`cell_line_ontology_id`	Requires a Cellosaurus ID.
`cell_type_ontology_id`	Requires a CL ID.
`disease_ontology_id`	Requires a MONDO ID.
`perturbation_ontology_id`	Must be a valid Ensembl gene ID (e.g., `ENSG00000156127`), ChEBI ID (e.g., `CHEBI:16681`), ChEMBL ID (e.g., `CHEMBL1234567`), or NCBI Taxonomy ID (e.g., `9606`).
`tissue_ontology_id`	Requires a UBERON ID.

We highly recommend using the EMBL-EBI Ontology Lookup Service to find valid IDs for your metadata.

Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error.

Modifying Query Inputs

You can customize the query inputs to fit your specific research needs:

# Get a base query
query <- get_example_query(model_id = "gem-1-bulk")$example_query

# Adjust number of samples for the first input
query$inputs[[1]]$num_samples <- 10

# Add a new condition
query$inputs[[3]] <- list(
  metadata = list(
    sex = "male",
    sample_type = "primary tissue",
    tissue_ontology_id = "UBERON:0002371"
  ),
  num_samples = 5
)

Working with Results

# Access metadata and expression matrices
metadata <- result$metadata
expression <- result$expression

# Check dimensions
dim(expression)

# View metadata sample
head(metadata)

You may want to process the data in chunks or save it for later use:

# Save results to RDS file
saveRDS(result, "synthesize_results.rds")

# Load previously saved results
result <- readRDS("synthesize_results.rds")

# Export as CSV
write.csv(result$expression, "expression_matrix.csv")
write.csv(result$metadata, "sample_metadata.csv")