The fast variant call format (VCF) file header intelligence and audit.
The main workflow is to parse a VCF header and write a standalone HTML report.
A preview of the bundled example HTML report is shown below.
The report includes file metadata, contig summaries, INFO and FORMAT field definitions, and other header entries in a portable HTML layout.
The vcfheader package reads and interprets Variant Call
Format (VCF) header metadata without loading full variant records. It is
designed for fast inspection, validation, and reporting of header
structure in both small and large VCF files.
This vignette shows how to:
vcf_hdr objectThe package examples use small VCF files shipped with the package, so the vignette works offline and is suitable for package checking.
Use read_vcf_header() when you want only the original
header lines.
hdr_lines <- read_vcf_header(simple_vcf)
head(hdr_lines, 5)
#> [1] "##fileformat=VCFv4.3"
#> [2] "##fileDate=20090805"
#> [3] "##source=myImputationProgramV3.1"
#> [4] "##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta"
#> [5] "##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=\"Homo sapiens\",taxonomy=x>"Use parse_vcf_header() to create a structured object
containing file metadata, contigs, INFO and FORMAT definitions, sample
names, warnings, and errors.
The parsed object is a list-like S3 object with standard components.
names(hdr)
#> [1] "file" "samples" "contigs" "info" "format"
#> [6] "filter" "alt" "sample_meta" "pedigree" "other_kv"
#> [11] "raw" "warnings" "errors"File-level metadata:
file_meta <- hdr$file
file_meta$input_path <- basename(file_meta$input_path)
file_meta
#> $input_path
#> [1] "simple.vcf"
#>
#> $fileformat
#> [1] "VCFv4.3"
#>
#> $fileDate
#> [1] "20090805"
#>
#> $source
#> [1] "myImputationProgramV3.1"
#>
#> $reference
#> [1] "file:///seq/references/1000GenomesPilot-NCBI36.fasta"
#>
#> $phasing
#> [1] "partial"Sample names:
INFO definitions:
hdr$info[, c("ID", "Number", "Type", "Description")]
#> ID Number Type Description
#> 1 NS 1 Integer Number of Samples With Data
#> 2 DP 1 Integer Total Depth
#> 3 AF A Float Allele Frequency
#> 4 AA 1 String Ancestral Allele
#> 5 DB 0 Flag dbSNP membership, build 129
#> 6 H2 0 Flag HapMap2 membershipFORMAT definitions:
hdr$format[, c("ID", "Number", "Type", "Description")]
#> ID Number Type Description
#> 1 GT 1 String Genotype
#> 2 GQ 1 Integer Genotype Quality
#> 3 DP 1 Integer Read Depth
#> 4 HQ 2 Integer Haplotype QualityValidation output:
The parser can add lightweight inferred annotations. For example, some common annotation systems can be guessed from reserved INFO tags.
The structural-variant example includes symbolic ALT definitions and structural INFO fields.
sv_hdr <- parse_vcf_header(sv_vcf)
sv_hdr
#> <vcf_hdr>
#> fileformat: VCFv4.4
#> samples: 1
#> contigs: 1
#> INFO: 9 FORMAT: 1 FILTER: 0 ALT: 6Contigs:
sv_hdr$contigs[, intersect(c("ID", "length", "assembly", "md5"), names(sv_hdr$contigs)), drop = FALSE]
#> ID length assembly md5
#> 1 chrA 1000000 <NA> <NA>ALT definitions:
sv_hdr$alt[, c("ID", "Description")]
#> ID Description
#> 1 INV Inversion
#> 2 INS Insertion
#> 3 DUP Duplication
#> 4 DUP:TANDEM Tandem Duplication
#> 5 DEL Deletion
#> 6 CNV Copy number variable regionINFO fields:
sv_hdr$info[, c("ID", "Number", "Type", "Description")]
#> ID Number Type
#> 1 MATEID A String
#> 2 END 1 Integer
#> 3 CIPOS . Integer
#> 4 SVLEN A Integer
#> 5 CILEN . Integer
#> 6 EVENT A String
#> 7 EVENTTYPE A String
#> 8 SVCLAIM A String
#> 9 IMPRECISE 0 Flag
#> Description
#> 1 ID of mate breakend
#> 2 End position of the longest variant described in this record
#> 3 Confidence interval around POS for symbolic structural variants
#> 4 Length of structural variant
#> 5 Confidence interval for the SVLEN field
#> 6 ID of associated event
#> 7 Type of associated event
#> 8 Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively
#> 9 Imprecise structural variationvcfheader() writes a standalone HTML report. This is the
main use case.
hdr <- parse_vcf_header("file.vcf")
out_file <- "file_report.html"
vcfheader(
hdr,
file = out_file
)
file.exists(out_file)
#> [1] TRUE
basename(out_file)
#> [1] "file_report.html"You can also let vcfheader() derive the output path from
the original input file path.
The package also ships a prebuilt example report generated from the
bundled simple.vcf file.
example_report <- system.file(
"extdata",
"simple_vcfheader.html",
package = "vcfheader"
)
basename(example_report)
#> [1] "simple_vcfheader.html"
file.exists(example_report)
#> [1] TRUEThis file can be opened directly in a browser after installation.
vcfheader provides a quick way to inspect VCF header
metadata, validate structure, extract structured definitions, and
produce a portable HTML summary for review or reporting.
For research use only.
VCFheader is a free and open-source tool provided by Switzerland Omics. Use of the software and generated reports is permitted, including in commercial settings, under the MIT License. Attribution to Switzerland Omics and VCFheader should be retained where reasonably practicable. Further information: https://switzerlandomics.ch. Switzerland OmicsĀ® is a registered trade mark. VCF specification references in this report relate to samtools and the broader HTS specifications ecosystem and are distributed under the MIT/Expat License by Genome Research Ltd.Ā Source: https://github.com/samtools/samtools. Further reading on HTS and file format specifications: https://www.htslib.org/doc/#file-formats.