Sequence Composition

# Sequence composition

There are many instances in analyzing sequence data where you will want to know about the composition of your sequences.

For example, for a given sequence, you may want to count how many of each possible Kmer, is present in the sequence. This would be important if - for instance - you wanted to analyze the Kmer spectra of your data. Alternatively you might have a collection of sequences, and may want to count how many of each unique sequence you have in your collection. This would be important if - for instance - your collection of sequences were from a population sample, and you wanted to compute the allele or genotype frequencies for the population.

Whatever the application, BioSequences provides a method called `composition`, and a parametric struct called `Composition` to both compute, and handle the results of such sequence composition calculations.

Sequence composition.

This is a subtype of `Associative{T,Int}`, and the `getindex` method returns the number of occurrences of a symbol or a k-mer.

source
``composition(seq | kmer_iter)``

Calculate composition of biological symbols in `seq` or k-mers in `kmer_iter`.

source
``composition(iter)``

A generalised composition algorithm, which computes the number of unique items produced by an iterable.

Example

``````
# Example, counting unique sequences.

julia> a = dna"AAAAAAAATTTTTT"
14nt DNA Sequence:
AAAAAAAATTTTTT

julia> b = dna"AAAAAAAATTTTTT"
14nt DNA Sequence:
AAAAAAAATTTTTT

julia> c = a[5:10]
6nt DNA Sequence:
AAAATT

julia> composition([a, b, c])
Vector{BioSequences.BioSequence{BioSequences.DNAAlphabet{4}}} Composition:
AAAATT         => 1
AAAAAAAATTTTTT => 2``````
source

For example to get the nucleotide composition of a sequence:

``````julia> comp = composition(dna"ACGAG")
DNA Composition:
DNA_A => 2
DNA_G => 2
DNA_C => 1

julia> comp[DNA_A]
2

julia> comp[DNA_T]
0
``````

Composition structs behave like an associative collection, such as a `Dict`. But there are a few differences:

1. The `getindex` method for Composition structs is overloaded to return a default value of 0, if a key is used that is not present in the Composition.

2. The `merge!` method for two Composition structs adds counts together, unlike the `merge!` method for other associative containers, which would overwrite the counts.

`merge!` is used to accumulate composition statistics of multiple sequences:

``````julia> # initiaize an empty composition counter
comp = composition(dna"");
ERROR: UndefVarError: @dna_str not defined

julia> # iterate over sequences and accumulate composition statistics into `comp`
for seq in seqs
merge!(comp, composition(seq))
end
ERROR: UndefVarError: seqs not defined

julia> # or functional programming style in one line
foldl((x, y) -> merge(x, composition(y)), composition(dna""), seqs)
ERROR: UndefVarError: @dna_str not defined``````

`composition` is also applicable to a k-mer iterator:

``````julia> comp = composition(each(DNAKmer{4}, dna"ACGT"^100));

julia> comp[DNAKmer("ACGT")]
100

julia> comp[DNAKmer("CGTA")]
99
``````