Non-standard genetic code

Introduction

cubar supports the codon usage bias analysis of sequences utilizing non-standard genetic codes, such as those found in mitochondrial or chloroplast protein-coding sequences. To illustrate its application, we demonstrate the calculation of effective number of codons (ENC) for human mitochondrial CDS sequences.

suppressPackageStartupMessages(library(Biostrings))
library(cubar)

Main analysis

First, Load sequences and get the corresponding codon table.

human_mt
#> DNAStringSet object of length 13:
#>      width seq                                              names               
#>  [1]   681 ATGAACGAAAATCTGTTCGCTTC...CTACCTGCACGACAACACATAA MT-ATP6
#>  [2]   346 ATAAACTTCGCCTTAATTTTAAT...AAAGGATTAGACTGAACCGAAT MT-ND3
#>  [3]   956 ATACCCATGGCCAACCTCCTACT...CCAGCATTCCCCCTCAAACCTA MT-ND1
#>  [4]   207 ATGCCCCAACTAAATACTACCGT...TTCATTGCCCCCACAATCCTAG MT-ATP8
#>  [5]  1141 ATGACCCCAATACGCAAAACTAA...AACAAAATACTCAAATGGGCCT MT-CYB
#>  ...   ... ...
#>  [9]  1042 ATTAATCCCCTGGCCCAACCCGT...CCTTTTATACTAATAATCTTAT MT-ND2
#> [10]   525 ATGATGTATGCTTTGTTTCTGTT...TGAGATTGCTCGGGGGAATAGG MT-ND6
#> [11]  1542 ATGTTCGCCGACCGTTGACTATT...ACCCGTATACATAAAATCTAGA MT-CO1
#> [12]   684 ATGGCACATGCAGCGCAAGTAGG...AGGGCCCGTATTTACCCTATAG MT-CO2
#> [13]   784 ATGACCCACCAATCACATGCCTA...TCCATCTATTGATGAGGGTCTT MT-CO3

ctab <- get_codon_table(gcid = '2')
head(ctab)
#>    aa_code amino_acid codon subfam
#> 1:       F        Phe   TTT Phe_TT
#> 2:       F        Phe   TTC Phe_TT
#> 3:       L        Leu   TTA Leu_TT
#> 4:       L        Leu   TTG Leu_TT
#> 5:       S        Ser   TCT Ser_TC
#> 6:       S        Ser   TCC Ser_TC

We do not check CDS length and stop codons as incomplete stop codons are prevalent among MT CDSs.

human_mt_qc <- check_cds(
    human_mt,
    codon_table = ctab,
    check_stop = FALSE,
    rm_stop = FALSE,
    check_len = FALSE,
    start_codons = c('ATG', 'ATA', 'ATT'))

human_mt_qc
#> DNAStringSet object of length 13:
#>      width seq                                              names               
#>  [1]   678 AACGAAAATCTGTTCGCTTCATT...CTACCTGCACGACAACACATAA MT-ATP6
#>  [2]   343 AACTTCGCCTTAATTTTAATAAT...AAAGGATTAGACTGAACCGAAT MT-ND3
#>  [3]   953 CCCATGGCCAACCTCCTACTCCT...CCAGCATTCCCCCTCAAACCTA MT-ND1
#>  [4]   204 CCCCAACTAAATACTACCGTATG...TTCATTGCCCCCACAATCCTAG MT-ATP8
#>  [5]  1138 ACCCCAATACGCAAAACTAACCC...AACAAAATACTCAAATGGGCCT MT-CYB
#>  ...   ... ...
#>  [9]  1039 AATCCCCTGGCCCAACCCGTCAT...CCTTTTATACTAATAATCTTAT MT-ND2
#> [10]   522 ATGTATGCTTTGTTTCTGTTGAG...TGAGATTGCTCGGGGGAATAGG MT-ND6
#> [11]  1539 TTCGCCGACCGTTGACTATTCTC...ACCCGTATACATAAAATCTAGA MT-CO1
#> [12]   681 GCACATGCAGCGCAAGTAGGTCT...AGGGCCCGTATTTACCCTATAG MT-CO2
#> [13]   781 ACCCACCAATCACATGCCTATCA...TCCATCTATTGATGAGGGTCTT MT-CO3

As stop codons are present, now we manually remove them.

len_trim <- width(human_mt_qc) %% 3
len_trim <- ifelse(len_trim == 0, 3, len_trim)
human_mt_qc <- subseq(human_mt_qc, start = 1, end = width(human_mt_qc) - len_trim)

human_mt_qc
#> DNAStringSet object of length 13:
#>      width seq                                              names               
#>  [1]   675 AACGAAAATCTGTTCGCTTCATT...CCTCTACCTGCACGACAACACA MT-ATP6
#>  [2]   342 AACTTCGCCTTAATTTTAATAAT...AAAAGGATTAGACTGAACCGAA MT-ND3
#>  [3]   951 CCCATGGCCAACCTCCTACTCCT...CTCCAGCATTCCCCCTCAAACC MT-ND1
#>  [4]   201 CCCCAACTAAATACTACCGTATG...TCATTCATTGCCCCCACAATCC MT-ATP8
#>  [5]  1137 ACCCCAATACGCAAAACTAACCC...AAACAAAATACTCAAATGGGCC MT-CYB
#>  ...   ... ...
#>  [9]  1038 AATCCCCTGGCCCAACCCGTCAT...CCCTTTTATACTAATAATCTTA MT-ND2
#> [10]   519 ATGTATGCTTTGTTTCTGTTGAG...AATTGAGATTGCTCGGGGGAAT MT-ND6
#> [11]  1536 TTCGCCGACCGTTGACTATTCTC...AGAACCCGTATACATAAAATCT MT-CO1
#> [12]   678 GCACATGCAGCGCAAGTAGGTCT...AATAGGGCCCGTATTTACCCTA MT-CO2
#> [13]   780 ACCCACCAATCACATGCCTATCA...CTCCATCTATTGATGAGGGTCT MT-CO3

Finally, we calculate codon frequencies and ENC.

# calculate codon frequency
mt_cf <- count_codons(human_mt_qc)

# calculate ENC
get_enc(mt_cf, codon_table = ctab)
#>  MT-ATP6   MT-ND3   MT-ND1  MT-ATP8   MT-CYB  MT-ND4L   MT-ND4   MT-ND5 
#> 46.44871 44.93068 42.42264 50.54562 42.97931 47.29975 42.65228 43.53521 
#>   MT-ND2   MT-ND6   MT-CO1   MT-CO2   MT-CO3 
#> 45.18387 45.56454 44.83277 49.19525 47.07683

It is important to note that the check_cds function and stop codon trimming are optional steps, and you can implement your own quality control procedures. However, it is crucial to ensure that your input sequences are suitable for codon usage bias analysis. Failure to do so may lead to ambiguous and misleading results from problematic sequences.