BioSequences module provides representations and tools for manipulating nucleotide and amino acid sequences.
Introduction to the sequence data-types
Sequences in BioSequences.jl are more strictly typed than in many other libraries; elements in a sequence are typed as biological symbol instead of character or byte. They are special purpose types rather than simply strings and hence offer additional functionality that naive string types don't have. Though this strictness sacrifices some convenience, it also means you can always rely on a DNA sequence type to store DNA and nothing but DNA, without having to check, or deal with lowercase versus uppercase and so on. Strict separation of sequence types also means we are free to choose the most efficient representation. DNA and RNA sequences are encoded using either four bits per base (which is the default), or two bits per base. This makes them memory efficient and allows us to speed up many common operations and transformations, like nucleotide composition, reverse complement, and k-mer enumeration.
BioSequences provides three different sequence types:
ReferenceSequence. Each of these types is a subtype of an abstract type called
Sequence and supports various string-like operations such as random access and iteration. Different sequence types have different features. In most situations,
BioSequence type will do and is used as the default representation. But sometimes other types are much more preferable in terms of memory efficiency and computation performance. Here is the summary table of these three types:
|general-purpose biological sequences||DNA, RNA, Amino acids||mutable||heap|
|specialized for short nucleotide sequences||DNA, RNA||immutable||stack / register|
|specialized for long reference genomes||DNA||immutable||heap|
Details of these different representations are explained in the following sections: