bioinformatics-course/Notebook.org

* Biology Meets Programming: Bioinformatics for Beginners
** Week 1
*** DNA replication

**** Origin of replication (ori)

Locating an ori is key for gene therapy (e.g. viral vectors),
to introduce a theraupetic gene.

**** Computational approaches to find ori in Vibrio Cholerae

***** Exercise: find Pattern

We'll look for the *DnaA box* sequence, using a sliding window, in that case
we will use this function to find out how many times
a sequence appears in the genome:

#+BEGIN_SRC python
def PatternCount(Text, Pattern):
    count = 0
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            count = count+1
    return count
#+END_SRC

For the second part, we're going to calculate the frequency map of the sequences
 of length /k/:

#+BEGIN_SRC python
def FrequentWords(Text, k):
    words = []
    freq = FrequencyMap(Text, k)
    m = max(freq.values())
    for key in freq:
        if freq[key] == m:
            words.append(key)
    return words


def FrequencyMap(Text, k):
    freq = {}
    n = len(Text)
    for i in range(n - k + 1):
        Pattern = Text[i:i + k]
        freq[Pattern] = 0
    for i in range(n - k + 1):
        Pattern = Text[i:i + k]
        freq[Pattern] += 1
    return freq
#+END_SRC

***** Exercise: Find the reverse complement of a sequence

We're going to generate the reverse complement of a sequence,
which is the complement of a sequence, read in the same direction (5' -> 3').

#+BEGIN_SRC python
def ReverseComplement(Pattern):
    Pattern = Reverse(Pattern)
    Pattern = Complement(Pattern)
    return Pattern


def Reverse(Pattern):
    reversed = Pattern[::-1]
    return reversed


def Complement(Pattern):
    compl = ""
    complement_letters = {"A": "T", "T": "A", "C": "G", "G": "C"}
    for char in Pattern:
        compl += complement_letters[char]
    return compl
#+END_SRC

After using our function on the /Vibrio Cholerae's/ genome, we realize that some
of the frequent /k-mers/ are reverse complements of other frequent ones.

***** Exercise: Find a subsequence within a sequence

We're going to find the ocurrences of a subsquence inside a sequence,
and save the index of the first letter in the sequence.

#+BEGIN_SRC python
def PatternMatching(Pattern, Genome):
    positions = []
    for i in range(len(Genome)-len(Pattern)+1):
        if Genome[i:i+len(Pattern)] == Pattern:
            positions.append(i)
    return positions
#+END_SRC

We find out that the /9-mers/ with the highest frequency appear in cluster.
There is strong statistical evidence that our subsequences are /DnaA boxes/.


**** Computational approaches to find ori in any bacteria

Now that we're pretty confident about the /DnaA boxes/ sequences that we found,
we are going to check if they are a common pattern in the rest of bacterias.
We're going to find the ocurrences of the sequences in /Thermotoga petrophila/:

#+BEGIN_SRC python
def PatternCount(Text, Pattern):
    count = 0
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            count = count+1
    return count
#+END_SRC

We observe that there are *no* ocurrences of the sequences found in
/Vibrio Cholerae/. We can conclude that different bacterias have
different /DnaA boxes/.

We have to try another computational approach,
find clusters of /k-mers/ repeated in a small interval.

** Week 2
*** DNA replication (II)

**** Replication process

The /DNA polymerases/ start replicating while the parent strands are unraveling.
On the lagging strand, the DNA polymerase waits until the replication fork
opens around 2000 nucleotides, and because of that it forms Okazaki fragments.
We need 1 primer for the leading strand and 1 primer per Okazaki fragment
for the lagging strand. While the Okazaki fragments are being synthetized,
a /DNA ligase/ starts joining the fragments together.

**** Computational approach to find ori using deamination

As the lagging strand is always waiting for the helicase to go forward, the
lagging strand is mostly in single-stranded configuration,
which is more prone to mutations. One frequent form of mutation is
*deamination*,a process that causes cytosine to convert into thymine.
This means that cytosine is more frequent in half of the genome.

***** Exercise: count the ocurrences of cytosine

We're going to count the ocurrences of the bases in a genome and include them in
a symbol array.

#+BEGIN_SRC python
def SymbolArray(Genome, symbol):
    array = {}
    n = len(Genome)
    ExtendedGenome = Genome + Genome[0:n//2]
    for i in range(n):
        array[i] = PatternCount(ExtendedGenome[i:i+(n//2)], symbol)
    return array


def PatternCount(Text, Pattern):
    count = 0
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            count = count+1
    return count
#+END_SRC

After executing the program, we realize that the algorithm is too inefficient.

***** Exercise: find a better algorithm for the previous exercise

This time, we are going to evaluate an element /i+1/, using the element /i/.

#+BEGIN_SRC python
def FasterSymbolArray(Genome, symbol):
    array = {}
    n = len(Genome)
    ExtendedGenome = Genome + Genome[0:n//2]
    array[0] = PatternCount(symbol, Genome[0:n//2])
    for i in range(1, n):
        array[i] = array[i-1]
        if ExtendedGenome[i-1] == symbol:
            array[i] = array[i]-1
        if ExtendedGenome[i+(n//2)-1] == symbol:
            array[i] = array[i]+1
    return array


def PatternCount(Text, Pattern):
    count = 0
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            count = count+1
    return count
#+END_SRC

It's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/.
In /Escherichia Coli/ we plotted the result of our program:

#+CAPTION: Symbol array for Cytosine in E. Coli Genome]
[[./Assets/e-coli.png]]

We can conclude that ori is located around position 4000000,
because that's where the Cytosine concentration is the lowest,
which indicates that the region stays single-stranded for the longest time.

**** The Skew Diagram

Usually scientists measure the difference between /G - C/, which is
*higher on the lagging strand* and *lower on the leading strand*.

***** Exercise: Synthetize a Skew Array

We're going to make a Skew Diagram, for that we'll first need a Skew Array. For
that purpose we wrote:

#+BEGIN_SRC python
def SkewArray(Genome):
    Skew = []
    Skew.append(0)
    for i in range(0, len(Genome)):
        if Genome[i] == "G":
            Skew.append(Skew[i] + 1)
        elif Genome[i] == "C":
            Skew.append(Skew[i] - 1)
        else:
            Skew.append(Skew[i])
    return Skew
#+END_SRC

We can see the utility of a Skew Diagram looking at the one from /Escherichia Coli/:

#+CAPTION: Symbol array for Cytosine in E. Coli Genome]
[[./Assets/skew_diagram.png]]

Ori should be located where the skew is at its minimum value.

***** Exercise: Efficient algorithm for locating ori

Now that we know more about ori's skew value, we're going to construct a better
algorithm to find it:

#+BEGIN_SRC python
def MinimumSkew(Genome):
    positions = []
    skew = SkewArray(Genome)
    minimum = min(skew)
    return [i for i in range(0, len(Genome)) if skew[i] == minimum]


def SkewArray(Genome):
    Skew = []
    Skew.append(0)
    for i in range(0, len(Genome)):
        if Genome[i] == "G":
            Skew.append(Skew[i] + 1)
        elif Genome[i] == "C":
            Skew.append(Skew[i] - 1)
        else:
            Skew.append(Skew[i])
    return Skew
#+END_SRC


**** Finding /DnaA boxes/

When we look for /DnaA boxes/ in the minimal skew region,
we can't find highly repeated /9-mers/ in /Escherichia Coli/. But we found
approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide.

***** Exercise: Calculate Hamming distance

The Hamming distance is the number of mismatches between 2 strings.

#+BEGIN_SRC python
def HammingDistance(p, q):
    count = 0
    for i in range(0, len(p)):
        if p[i] != q[i]:
            count += 1
    return count
#+END_SRC

***** Exercise: Find approximate patterns

Now that we have our Hamming distance, we use it to find
the approximate sequences:

#+BEGIN_SRC python
def ApproximatePatternMatching(Text, Pattern, d):
    positions = []
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            positions.append(i)
        elif HammingDistance(Text[i:i+len(Pattern)], Pattern) <= d:
            positions.append(i)
    return positions


def HammingDistance(p, q):
    count = 0
    for i in range(0, len(p)):
        if p[i] != q[i]:
            count += 1
    return count
#+END_SRC

***** Exercise: Count the approximate patterns

The final part is counting the approximate sequences:

#+BEGIN_SRC python
def ApproximatePatternCount(Pattern, Text, d):
    count = 0
    for i in range(len(Text) - len(Pattern) + 1):
        if (
            Text[i : i + len(Pattern)] == Pattern
            or HammingDistance(Text[i : i + len(Pattern)], Pattern) <= d
        ):
            count += 1
    return count


def HammingDistance(p, q):
    count = 0
    for i in range(0, len(p)):
        if p[i] != q[i]:
            count += 1
    return count
#+END_SRC

After trying out our ApproximatePatternCount in the hypothesized ori region,
we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/.
We've finally found a computational method to find ori that seems correct.

** Week 3

*** The circadian clock

Variation in gene expression permits the cell to keep track of time.

**** Computational approaches to find regulatory motifs

***** Exercise: Find the most common nucleotides in each position

We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string.
In each position, we'll insert the most frequent nucleotide, in upper case,
and the nucleotide in lower case (if there's no popular one).
Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most
upper case letters. We'll use a *4 x k* Count Matrix, one row for each base.

#+BEGIN_SRC python
def Count(Motifs):
    k = len(Motifs[0])
    count = {'A': [0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0] * k}
    t = len(Motifs)
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
    return count
#+END_SRC

Now that we have a Count Matrix, we will generate a Profile Matrix, which has
 the frequency of the nucleotide instead of the count:

#+BEGIN_SRC python
def Profile(Motifs):
    t = len(Motifs)
    profile = Count(Motifs)
    for key, v in profile.items():
        v[:] = [x / t for x in v]
    return profile


def Count(Motifs):
    k = len(Motifs[0])
    count = {'A': [0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0] * k}
    t = len(Motifs)
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
    return count
#+END_SRC

***** Exercise: Form the most frequent sequence of nucleotides

Finally, we can form a Consensus string, to get a candidate regulatory motif:
#+BEGIN_SRC python
def Consensus(Motifs):
    consensus = ""
    count = Count(Motifs)
    k = len(Motifs[0])
    for j in range(k):
        m = 0
        frequentSymbol = ""
        for symbol in "ACGT":
            if count[symbol][j] > m:
                m = count[symbol][j]
                frequentSymbol = symbol
        consensus += frequentSymbol
    return consensus


def Count(Motifs):
    k = len(Motifs[0])
    count = {'A': [0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0] * k}
    t = len(Motifs)
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
    return count
#+END_SRC

After obtaining the Consensus string, all we need to do is obtain the total
score of our selected /k-mers/:

#+BEGIN_SRC python
def Score(Motifs):
    score = 0
    for i in range(len(Motifs)):
        score += HammingDistance(Motifs[i], Consensus(Motifs))
    return score


def Consensus(Motifs):
    consensus = ""
    count = Count(Motifs)
    k = len(Motifs[0])
    for j in range(k):
        m = 0
        frequentSymbol = ""
        for symbol in "ACGT":
            if count[symbol][j] > m:
                m = count[symbol][j]
                frequentSymbol = symbol
        consensus += frequentSymbol
    return consensus


def Count(Motifs):
    k = len(Motifs[0])
    count = {'A': [0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0] * k}
    t = len(Motifs)
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
    return count


def HammingDistance(p, q):
    count = 0
    for i in range(0, len(p)):
        if p[i] != q[i]:
            count += 1
    return count
#+END_SRC

***** Exercise: Find a set of /k-mers/ that minimize the score


Applying a brute force approach for this task is not viable, we'll use a Greedy
Algorithm. We first have to determine the probability of a sequence:

#+BEGIN_SRC python
def Pr(Text, Profile):
    probability = 1
    k = len(Text)
    for i in range(k):
        probability *= Profile[Text[i]][i]
    return probability
#+END_SRC

We'll use this function to find the most probable /k-mer/ in a sequence:

#+BEGIN_SRC python
def ProfileMostProbableKmer(text, k, profile):
    kmer = ""
    keys = ["A", "C", "G", "T"]
    d = dict(zip(keys,profile))
    prob = -1
    for i in range(len(text)-k+1):
        if (Pr((text[i:i+k]), profile) > prob):
            prob = Pr(text[i:i+k], profile)
            kmer = text[i:i+k]
    return kmer


def Pr(Text, Profile):
    probability = 1
    k = len(Text)
    for i in range(k):
        probability *= Profile[Text[i]][i]
    return probability
#+END_SRC

Now we're finally ready to assemble all the pieces and implement a Greedy Motif
Search Algorithm:

#+BEGIN_SRC python
def GreedyMotifSearch(Dna, k, t):
    BestMotifs = []
    for i in range(0, t):
        BestMotifs.append(Dna[i][0:k])
    n = len(Dna[0])
    for i in range(n-k+1):
        Motifs = []
        Motifs.append(Dna[0][i:i+k])
        for j in range(1, t):
            P = Profile(Motifs[0:j])
            Motifs.append(ProfileMostProbableKmer(Dna[j], k, P))
        if Score(Motifs) < Score(BestMotifs):
            BestMotifs = Motifs
    return BestMotifs

def Score(Motifs):
    score = 0
    for i in range(len(Motifs)):
        score += HammingDistance(Motifs[i], Consensus(Motifs))
    return score


def Consensus(Motifs):
    consensus = ""
    count = Count(Motifs)
    k = len(Motifs[0])
    for j in range(k):
        m = 0
        frequentSymbol = ""
        for symbol in "ACGT":
            if count[symbol][j] > m:
                m = count[symbol][j]
                frequentSymbol = symbol
        consensus += frequentSymbol
    return consensus


def Count(Motifs):
    k = len(Motifs[0])
    count = {'A': [0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0] * k}
    t = len(Motifs)
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
    return count


def HammingDistance(p, q):
    count = 0
    for i in range(0, len(p)):
        if p[i] != q[i]:
            count += 1
    return count


def Profile(Motifs):
    t = len(Motifs)
    profile = Count(Motifs)
    for key, v in profile.items():
        v[:] = [x / t for x in v]
    return profile


def ProfileMostProbableKmer(text, k, profile):
    kmer = ""
    keys = ["A", "C", "G", "T"]
    d = dict(zip(keys,profile))
    prob = -1
    for i in range(len(text)-k+1):
        if (Pr((text[i:i+k]), profile) > prob):
            prob = Pr(text[i:i+k], profile)
            kmer = text[i:i+k]
    return kmer


def Pr(Text, Profile):
    probability = 1
    k = len(Text)
    for i in range(k):
        probability *= Profile[Text[i]][i]
    return probability
#+END_SRC

***** Motifs in tuberculosis

Tuberculosis is an infectious disease, caused by a bacteria called /Mycobacterium
tuberculosis/. The bacteria can stay latent in the host for decades, in hypoxic
environments.
Our Greedy Algorithm can help us identify a motif that might be involved
in the process.

The transcription factor behind this behaviour is *DosR*, we'll identify the
binding sites:

#+BEGIN_SRC python
def GreedyMotifSearch(Dna, k, t):
    BestMotifs = []
    for i in range(0, t):
        BestMotifs.append(Dna[i][0:k])
    n = len(Dna[0])
    for i in range(n - k + 1):
        Motifs = []
        Motifs.append(Dna[0][i : i + k])
        for j in range(1, t):
            P = Profile(Motifs[0:j])
            Motifs.append(ProfileMostProbableKmer(Dna[j], k, P))
        if Score(Motifs) < Score(BestMotifs):
            BestMotifs = Motifs
    return BestMotifs


def Score(Motifs):
    score = 0
    for i in range(len(Motifs)):
        score += HammingDistance(Motifs[i], Consensus(Motifs))
    return score


def Consensus(Motifs):
    consensus = ""
    count = Count(Motifs)
    k = len(Motifs[0])
    for j in range(k):
        m = 0
        frequentSymbol = ""
        for symbol in "ACGT":
            if count[symbol][j] > m:
                m = count[symbol][j]
                frequentSymbol = symbol
        consensus += frequentSymbol
    return consensus


def Count(Motifs):
    k = len(Motifs[0])
    count = {"A": [0] * k, "C": [0] * k, "G": [0] * k, "T": [0] * k}
    t = len(Motifs)
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
    return count


def HammingDistance(p, q):
    count = 0
    for i in range(0, len(p)):
        if p[i] != q[i]:
            count += 1
    return count


def Profile(Motifs):
    t = len(Motifs)
    profile = Count(Motifs)
    for key, v in profile.items():
        v[:] = [x / t for x in v]
    return profile


def ProfileMostProbableKmer(text, k, profile):
    kmer = ""
    keys = ["A", "C", "G", "T"]
    d = dict(zip(keys, profile))
    prob = -1
    for i in range(len(text) - k + 1):
        if Pr((text[i : i + k]), profile) > prob:
            prob = Pr(text[i : i + k], profile)
            kmer = text[i : i + k]
    return kmer


def Pr(Text, Profile):
    probability = 1
    k = len(Text)
    for i in range(k):
        probability *= Profile[Text[i]][i]
    return probability


Dna = [
    "GCGCCCCGCCCGGACAGCCATGCGCTAACCCTGGCTTCGATGGCGCCGGCTCAGTTAGGGCCGGAAGTCCCCAATGTGGCAGACCTTTCGCCCCTGGCGGACGAATGACCCCAGTGGCCGGGACTTCAGGCCCTATCGGAGGGCTCCGGCGCGGTGGTCGGATTTGTCTGTGGAGGTTACACCCCAATCGCAAGGATGCATTATGACCAGCGAGCTGAGCCTGGTCGCCACTGGAAAGGGGAGCAACATC",
    "CCGATCGGCATCACTATCGGTCCTGCGGCCGCCCATAGCGCTATATCCGGCTGGTGAAATCAATTGACAACCTTCGACTTTGAGGTGGCCTACGGCGAGGACAAGCCAGGCAAGCCAGCTGCCTCAACGCGCGCCAGTACGGGTCCATCGACCCGCGGCCCACGGGTCAAACGACCCTAGTGTTCGCTACGACGTGGTCGTACCTTCGGCAGCAGATCAGCAATAGCACCCCGACTCGAGGAGGATCCCG",
    "ACCGTCGATGTGCCCGGTCGCGCCGCGTCCACCTCGGTCATCGACCCCACGATGAGGACGCCATCGGCCGCGACCAAGCCCCGTGAAACTCTGACGGCGTGCTGGCCGGGCTGCGGCACCTGATCACCTTAGGGCACTTGGGCCACCACAACGGGCCGCCGGTCTCGACAGTGGCCACCACCACACAGGTGACTTCCGGCGGGACGTAAGTCCCTAACGCGTCGTTCCGCACGCGGTTAGCTTTGCTGCC",
    "GGGTCAGGTATATTTATCGCACACTTGGGCACATGACACACAAGCGCCAGAATCCCGGACCGAACCGAGCACCGTGGGTGGGCAGCCTCCATACAGCGATGACCTGATCGATCATCGGCCAGGGCGCCGGGCTTCCAACCGTGGCCGTCTCAGTACCCAGCCTCATTGACCCTTCGACGCATCCACTGCGCGTAAGTCGGCTCAACCCTTTCAAACCGCTGGATTACCGACCGCAGAAAGGGGGCAGGAC",
    "GTAGGTCAAACCGGGTGTACATACCCGCTCAATCGCCCAGCACTTCGGGCAGATCACCGGGTTTCCCCGGTATCACCAATACTGCCACCAAACACAGCAGGCGGGAAGGGGCGAAAGTCCCTTATCCGACAATAAAACTTCGCTTGTTCGACGCCCGGTTCACCCGATATGCACGGCGCCCAGCCATTCGTGACCGACGTCCCCAGCCCCAAGGCCGAACGACCCTAGGAGCCACGAGCAATTCACAGCG",
    "CCGCTGGCGACGCTGTTCGCCGGCAGCGTGCGTGACGACTTCGAGCTGCCCGACTACACCTGGTGACCACCGCCGACGGGCACCTCTCCGCCAGGTAGGCACGGTTTGTCGCCGGCAATGTGACCTTTGGGCGCGGTCTTGAGGACCTTCGGCCCCACCCACGAGGCCGCCGCCGGCCGATCGTATGACGTGCAATGTACGCCATAGGGTGCGTGTTACGGCGATTACCTGAAGGCGGCGGTGGTCCGGA",
    "GGCCAACTGCACCGCGCTCTTGATGACATCGGTGGTCACCATGGTGTCCGGCATGATCAACCTCCGCTGTTCGATATCACCCCGATCTTTCTGAACGGCGGTTGGCAGACAACAGGGTCAATGGTCCCCAAGTGGATCACCGACGGGCGCGGACAAATGGCCCGCGCTTCGGGGACTTCTGTCCCTAGCCCTGGCCACGATGGGCTGGTCGGATCAAAGGCATCCGTTTCCATCGATTAGGAGGCATCAA",
    "GTACATGTCCAGAGCGAGCCTCAGCTTCTGCGCAGCGACGGAAACTGCCACACTCAAAGCCTACTGGGCGCACGTGTGGCAACGAGTCGATCCACACGAAATGCCGCCGTTGGGCCGCGGACTAGCCGAATTTTCCGGGTGGTGACACAGCCCACATTTGGCATGGGACTTTCGGCCCTGTCCGCGTCCGTGTCGGCCAGACAAGCTTTGGGCATTGGCCACAATCGGGCCACAATCGAAAGCCGAGCAG",
    "GGCAGCTGTCGGCAACTGTAAGCCATTTCTGGGACTTTGCTGTGAAAAGCTGGGCGATGGTTGTGGACCTGGACGAGCCACCCGTGCGATAGGTGAGATTCATTCTCGCCCTGACGGGTTGCGTCTGTCATCGGTCGATAAGGACTAACGGCCCTCAGGTGGGGACCAACGCCCCTGGGAGATAGCGGTCCCCGCCAGTAACGTACCGCTGAACCGACGGGATGTATCCGCCCCAGCGAAGGAGACGGCG",
    "TCAGCACCATGACCGCCTGGCCACCAATCGCCCGTAACAAGCGGGACGTCCGCGACGACGCGTGCGCTAGCGCCGTGGCGGTGACAACGACCAGATATGGTCCGAGCACGCGGGCGAACCTCGTGTTCTGGCCTCGGCCAGTTGTGTAGAGCTCATCGCTGTCATCGAGCGATATCCGACCACTGATCCAAGTCGGGGGCTCTGGGGACCGAAGTCCCCGGGCTCGGAGCTATCGGACCTCACGATCACC",
]

t = len(Dna)
k = 15

Motifs = GreedyMotifSearch(Dna, k, t)

print(Motifs)
print(Score(Motifs))
#+END_SRC

#+RESULTS:
: ['GTTAGGGCCGGAAGT', 'CCGATCGGCATCACT', 'ACCGTCGATGTGCCC', 'GGGTCAGGTATATTT', 'GTGACCGACGTCCCC', 'CTGTTCGCCGGCAGC', 'CTGTTCGATATCACC', 'GTACATGTCCAGAGC', 'GCGATAGGTGAGATT', 'CTCATCGCTGTCATC']
: 64

Our algorithm is pretty fast, but it's not optimal, and that's just a
characteristic of Greedy Algorithms, they trade optimality for speed.

** Vocabulary
- k-mer: subsquences of length /k/ in a biological sequence
- Frequency map: sequence --> frequency of the sequence