Reference Conditioning

Overview

Reference conditioning models generate expression data conditioned on a real reference sample. This allows you to “anchor” to an existing expression profile while applying perturbations or modifications.

This is useful when you want to:

Available Models

Note: These endpoints may require 1-2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use.

library(rsynthbio)

How It Works

Reference conditioning encodes the biological and technical characteristics from a real expression sample, then generates new expression data that:

  1. Preserves the biological/technical latent space of the reference
  2. Applies any perturbation metadata you specify
  3. Returns synthetic expression that reflects the perturbation effect on that specific sample

Creating a Query

Reference conditioning queries require different inputs than baseline models:

# Get the example query structure
example_query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query

# Inspect the query structure
str(example_query)

The query structure includes:

  1. inputs: A list where each input contains:

    • counts: The reference expression counts (a named list with a counts vector)
    • metadata: Perturbation-only metadata (see below)
    • num_samples: How many samples to generate
  2. conditioning: Which latent spaces to condition on (typically ["biological", "technical"])

  3. sampling_strategy: "mean estimation" or "sample generation"

Perturbation-Only Metadata

Unlike baseline models, reference conditioning queries only accept perturbation metadata fields:

All other biological and technical metadata is inferred from the reference expression.

Example: Simulating a Drug Treatment

Here’s a complete example simulating a drug treatment effect on a reference sample:

# Start with example query structure
query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query

# Replace with your actual reference counts
# The counts vector must match the model's expected gene order and length
query$inputs[[1]]$counts <- list(counts = your_reference_counts)

# Specify the perturbation
query$inputs[[1]]$metadata <- list(
  perturbation_ontology_id = "CHEMBL25", # Aspirin (ChEMBL ID)
  perturbation_type = "compound",
  perturbation_time = "24h",
  perturbation_dose = "10uM"
)

query$inputs[[1]]$num_samples <- 3

# Set the sampling strategy
query$sampling_strategy <- "mean estimation"

# Submit the query
result <- predict_query(query, model_id = "gem-1-bulk_reference-conditioning")

Example: CRISPR Knockout Simulation

Simulate the effect of knocking out a specific gene:

query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query

# Your reference sample counts
query$inputs[[1]]$counts <- list(counts = control_sample_counts)

# CRISPR knockout of TP53
query$inputs[[1]]$metadata <- list(
  perturbation_ontology_id = "ENSG00000141510", # TP53 Ensembl ID
  perturbation_type = "crispr"
)

query$inputs[[1]]$num_samples <- 5

result <- predict_query(query, model_id = "gem-1-bulk_reference-conditioning")

Query Parameters

conditioning (list, optional)

Controls which latent spaces are conditioned on the reference. Default is ["biological", "technical"].

When both are conditioned, the model preserves both biological identity and technical characteristics from the reference sample.

sampling_strategy (character, required)

Controls the type of prediction:

query$sampling_strategy <- "mean estimation"

fixed_total_count (logical, optional)

Controls whether to preserve the reference’s library size:

# Preserve reference library size (default)
query$fixed_total_count <- FALSE

# Or force a specific library size
query$fixed_total_count <- TRUE
query$total_count <- 10000000

total_count (integer, optional)

Library size used when converting predicted log CPM back to raw counts. Only effective when fixed_total_count = TRUE.

deterministic_latents (logical, optional)

If TRUE, the model uses the mean of each latent distribution (p(z|metadata) for perturbation, q(z|x) for conditioned components) instead of sampling. This produces deterministic, reproducible outputs.

query$deterministic_latents <- TRUE

seed (integer, optional)

Random seed for reproducibility.

query$seed <- 42

Valid Perturbation Metadata

Field Description / Format
perturbation_ontology_id Ensembl gene ID (e.g., ENSG00000141510), ChEBI ID, ChEMBL ID, or NCBI Taxonomy ID
perturbation_type One of: “coculture”, “compound”, “control”, “crispr”, “genetic”, “infection”, “other”, “overexpression”, “peptide or biologic”, “shrna”, “sirna”
perturbation_time Time since perturbation (e.g., “24h”, “48h”)
perturbation_dose Dose of perturbation (e.g., “10uM”, “1mg/kg”)

Working with Results

The result structure is similar to baseline models:

# Access metadata and expression matrices
metadata <- result$metadata
expression <- result$expression

# Compare to your reference
dim(expression)
head(metadata)

Differential Expression

When conditioning on both biological and technical latents, you can directly compare the generated expression to your reference to identify perturbation effects:

# Your reference (input) counts
reference_cpm <- your_reference_counts / sum(your_reference_counts) * 1e6

# Generated (perturbed) counts
generated_cpm <- expression[1, ] / sum(expression[1, ]) * 1e6

# Log fold change
log2fc <- log2(generated_cpm + 1) - log2(reference_cpm + 1)

# Identify top changed genes
head(sort(log2fc, decreasing = TRUE), 20)

Important Notes

Counts Vector Length

The reference counts vector must match the model’s expected number of genes. If the length doesn’t match, the API will return a validation error.

Use get_example_query() to see the expected structure and ensure your counts vector has the correct length.

Gene Order

Ensure your reference counts are in the same gene order expected by the model. The response includes a gene_order field that specifies the expected order.