edit

Nucleic Acids

Type definitions

BioSymbols provides two types of nucleic acids:

Type Meaning
DNA deoxyribonucleic acid
RNA ribonucleic acid

These two are an 8-bit primitive type and a subtype of NucleicAcid.

julia> sizeof(DNA)
1

julia> DNA <: NucleicAcid
true

The set of nucleotide symbols in BioSymbols.jl covers the IUPAC nucleotides as well as a GAP (-) symbol.

Symbol Constant Meaning
'A' DNA_A / RNA_A A; Adenine
'C' DNA_C / RNA_C C; Cytosine
'G' DNA_G / RNA_G G; Guanine
'T' DNA_T T; Thymine (DNA only)
'U' RNA_U U; Uracil (RNA only)
'M' DNA_M / RNA_M A or C
'R' DNA_R / RNA_R A or G
'W' DNA_W / RNA_W A or T/U
'S' DNA_S / RNA_S C or G
'Y' DNA_Y / RNA_Y C or T/U
'K' DNA_K / RNA_K G or T/U
'V' DNA_V / RNA_V A or C or G; not T/U
'H' DNA_H / RNA_H A or C or T; not G
'D' DNA_D / RNA_D A or G or T/U; not C
'B' DNA_B / RNA_B C or G or T/U; not A
'N' DNA_N / RNA_N A or C or G or T/U
'-' DNA_Gap / RNA_Gap Gap (none of the above)

http://www.insdc.org/documents/feature_table.html#7.4.1

These are accessible as constants with DNA_ or RNA_ prefix:

julia> DNA_A
DNA_A

julia> DNA_T
DNA_T

julia> RNA_U
RNA_U

julia> DNA_Gap
DNA_Gap

julia> typeof(DNA_A)
BioSymbols.DNA

julia> typeof(RNA_A)
BioSymbols.RNA

Symbols can be constructed by converting regular characters:

julia> convert(DNA, 'C')
DNA_C

julia> convert(DNA, 'C') === DNA_C
true

julia> convert(DNA, 'c') === convert(DNA, 'C')  # convertion is not case-sensitive
true

print and show methods are defined to output the text representation of a symbol:

julia> print(DNA_A)  # un-decorated text
A
julia> show(DNA_A)   # informative text
DNA_A

Bit encoding

Every nucleotide is encoded using the lower 4 bits of a byte. An unambiguous nucleotide has only one set bit and the other bits are unset. The table below summarizes all unambiguous nucleotides and their corresponding bits. An ambiguous nucleotide is the bitwise OR of unambiguous nucleotides that the ambiguous nucleotide can take. For example, DNA_R (meaning the nucleotide is either DNA_A or DNA_G) is encoded as 0101 because 0101 is the bitwise OR of 0001 (DNA_A) and 0100 (DNA_G). The gap symbol is always 0000.

julia> bits(reinterpret(UInt8, DNA_A))
"00000001"

julia> bits(reinterpret(UInt8, DNA_G))
"00000100"

julia> bits(reinterpret(UInt8, DNA_R))
"00000101"

This bit encoding enables efficient bit operations:

julia> DNA_A | DNA_G  # A or G
DNA_R

julia> DNA_A & DNA_G  # A and G
DNA_Gap

julia> DNA_A | ~DNA_A  # A or not A
DNA_N

julia> DNA_A | DNA_C | DNA_G | DNA_T  # any DNA
DNA_N