UKB phenotype data is stored in a proprietary .dataset
format on the RAP and cannot be read directly. The
extract_* functions provide R interfaces for discovering
approved fields and extracting phenotype data via the DNAnexus
dx extract_dataset and table-exporter
tools.
Two workflows are available:
| Function | Mode | Scale | Output |
|---|---|---|---|
extract_batch() |
Async job | Large / production (typically 50+ fields) | job ID → CSV on RAP cloud |
extract_pheno() |
Synchronous | Small (quick checks) | data.table in memory |
extract_batch() is the recommended
approach for any serious analysis. extract_pheno()
is provided for quick interactive inspection inside the RAP environment
only.
Ensure you are authenticated and have selected your project:
Before extracting, use extract_ls() to explore what
fields are approved for your project:
# List all approved fields (cached after first call)
extract_ls()
# Search by keyword
extract_ls(pattern = "cancer")
extract_ls(pattern = "p31|p53|p21022")
# Force refresh after switching projects or datasets
extract_ls(refresh = TRUE)The result is a data.frame with two columns:
| Column | Example |
|---|---|
field_name |
participant.p53_i0 |
title |
Date of attending assessment centre \| Instance 0 |
Fields reflect your project’s approved data only — not all UKB fields are present.
extract_batch()For large-scale or production extractions, submit an asynchronous table-exporter job on the RAP cloud:
# Submit extraction job
job_id <- extract_batch(c(31, 53, 21022, 22189))
# Custom output name
job_id <- extract_batch(
field_id = c(31, 53, 21022, 22189),
file = "ukb_demographics"
)
# High priority (faster queue, higher cost)
job_id <- extract_batch(
field_id = c(31, 53, 21022, 22189),
priority = "high"
)The job runs asynchronously on the RAP cloud. The output CSV is saved
to your RAP project and can be monitored with the job_
series:
job_status(job_id) # check progress
job_path(job_id) # get cloud file path once complete
job_result(job_id) # read result as data.table (inside RAP only)extract_batch() automatically selects an appropriate
instance based on the number of columns:
| Columns | Instance |
|---|---|
| ≤ 20 | mem1_ssd1_v2_x4 |
| ≤ 100 | mem1_ssd1_v2_x8 |
| ≤ 500 | mem1_ssd1_v2_x16 |
| > 500 | mem1_ssd1_v2_x36 |
You can override this with the instance_type argument if
needed.
extract_pheno()For small-scale interactive checks inside the RAP RStudio environment:
extract_pheno()is restricted to the RAP environment and returns data in memory only. For any analysis intended to be saved or reproduced, useextract_batch().
Note: extract_pheno() returns raw coded
values (e.g. 1/0 for Sex, numeric
codes for diseases). Use the decode_* series to convert
codes to human-readable labels.
Column naming differs between the two extraction methods:
extract_batch() — no prefix:
| Column | Meaning |
|---|---|
eid |
Participant ID |
p31 |
Field 31 (Sex) |
p53_i0 |
Field 53, Instance 0 |
p20002_i0_a0 |
Field 20002, Instance 0, Array 0 |
extract_pheno() —
participant. prefix:
| Column | Meaning |
|---|---|
participant.eid |
Participant ID |
participant.p31 |
Field 31 (Sex) |
participant.p53_i0 |
Field 53, Instance 0 |
participant.p20002_i0_a0 |
Field 20002, Instance 0, Array 0 |
?extract_ls, ?extract_pheno,
?extract_batchvignette("auth") — authentication setup