Get Started with ukbflow

Welcome to ukbflow

ukbflow is an R package for UK Biobank analysis on the Research Analysis Platform (RAP). It covers the full midstream-to-downstream pipeline — from phenotype derivation and association analysis to publication-ready figures and genetic risk scoring — designed for RAP-native UKB workflows, with local simulated data for development and testing.

Installation

pak::pkg_install("evanbio/ukbflow")

A Quick Taste

Load data

library(ukbflow)

df <- ops_toy()   # synthetic UKB-like cohort, no RAP connection needed

# On RAP, replace with:
# auth_login()
# auth_select_project("project-XXXXXXXXXXXX")
# df <- extract_pheno(c(31, 21022, 53, 20116)) |>
#   decode_values() |>
#   decode_names()

Derive a disease phenotype

df <- df |>
  derive_missing() |>                                               # recode "Prefer not to answer" → NA
  derive_selfreport(name = "t2dm", regex = "diabetes",           # T2DM self-report
                    field = "noncancer") |>
  derive_icd10(name = "t2dm", icd10 = "E11", source = "hes") |> # T2DM from HES
  derive_case(name = "t2dm") |>                                  # → t2dm_status, t2dm_date
  derive_followup(name         = "t2dm",
                  event_col    = "t2dm_date",
                  baseline_col = "p53_i0",                          # assessment centre date
                  censor_date  = as.Date("2022-06-01"))

Run an association model

res <- assoc_coxph(
  data         = df,
  outcome_coll  = "t2dm_status",
  time_col     = "t2dm_followup_years",
  exposure_col = "p21001_i0",   # BMI (continuous)
  covariates   = c("p21022",    # age_at_recruitment
                   "p31")       # sex
)

Plot the results

# Forest plot — see vignette("plot") for full usage
res_df <- as.data.frame(res)
plot_forest(
  data      = res_df,
  est       = res_df$HR,
  lower     = res_df$CI_lower,
  upper     = res_df$CI_upper,
  ci_column = 7L   # res_df has 6 cols before HR; CI graphic goes here
)

# Table 1
plot_tableone(
  data   = as.data.frame(df),
  vars   = c("p21022",     # age_at_recruitment
             "p31",        # sex
             "p21001_i0"), # bmi
  strata = "t2dm_status"
)

Full Function Overview

Module Key functions Vignette
Auth auth_login(), auth_select_project() vignette("auth")
Fetch fetch_ls(), fetch_file(), fetch_tree() vignette("fetch")
Extract extract_pheno(), extract_batch(), extract_ls() vignette("extract")
Job job_wait(), job_status(), job_result() vignette("job")
Decode decode_values(), decode_names() vignette("decode")
Derive derive_missing(), derive_icd10(), derive_case() vignette("derive")
Survival derive_timing(), derive_age(), derive_followup() vignette("derive-survival")
Assoc assoc_coxph(), assoc_logistic(), assoc_subgroup() vignette("assoc")
Plot plot_forest(), plot_tableone() vignette("plot")
GRS grs_check(), grs_score(), grs_validate() vignette("grs")
Ops ops_setup(), ops_toy(), ops_snapshot() vignette("ops")

End-to-End Case Study

For a complete worked example using a simulated UK Biobank cohort — covering data loading, phenotype derivation, cohort assembly, Cox regression, and publication-ready visualisation — see:

vignette("smoking_lung_cancer")Smoking and Lung Cancer Risk: A Complete Analysis Workflow

Additional Resources

“All models are wrong, but some are publishable.”

— after George Box