713 lines
21 KiB
Org Mode
713 lines
21 KiB
Org Mode
* Biology Meets Programming: Bioinformatics for Beginners
|
|
** Week 1
|
|
*** DNA replication
|
|
|
|
**** Origin of replication (ori)
|
|
|
|
Locating an ori is key for gene therapy (e.g. viral vectors),
|
|
to introduce a theraupetic gene.
|
|
|
|
**** Computational approaches to find ori in Vibrio Cholerae
|
|
|
|
***** Exercise: find Pattern
|
|
|
|
We'll look for the *DnaA box* sequence, using a sliding window, in that case
|
|
we will use this function to find out how many times
|
|
a sequence appears in the genome:
|
|
|
|
#+BEGIN_SRC python
|
|
def PatternCount(Text, Pattern):
|
|
count = 0
|
|
for i in range(len(Text)-len(Pattern)+1):
|
|
if Text[i:i+len(Pattern)] == Pattern:
|
|
count = count+1
|
|
return count
|
|
#+END_SRC
|
|
|
|
For the second part, we're going to calculate the frequency map of the sequences
|
|
of length /k/:
|
|
|
|
#+BEGIN_SRC python
|
|
def FrequentWords(Text, k):
|
|
words = []
|
|
freq = FrequencyMap(Text, k)
|
|
m = max(freq.values())
|
|
for key in freq:
|
|
if freq[key] == m:
|
|
words.append(key)
|
|
return words
|
|
|
|
|
|
def FrequencyMap(Text, k):
|
|
freq = {}
|
|
n = len(Text)
|
|
for i in range(n - k + 1):
|
|
Pattern = Text[i:i + k]
|
|
freq[Pattern] = 0
|
|
for i in range(n - k + 1):
|
|
Pattern = Text[i:i + k]
|
|
freq[Pattern] += 1
|
|
return freq
|
|
#+END_SRC
|
|
|
|
***** Exercise: Find the reverse complement of a sequence
|
|
|
|
We're going to generate the reverse complement of a sequence,
|
|
which is the complement of a sequence, read in the same direction (5' -> 3').
|
|
|
|
#+BEGIN_SRC python
|
|
def ReverseComplement(Pattern):
|
|
Pattern = Reverse(Pattern)
|
|
Pattern = Complement(Pattern)
|
|
return Pattern
|
|
|
|
|
|
def Reverse(Pattern):
|
|
reversed = Pattern[::-1]
|
|
return reversed
|
|
|
|
|
|
def Complement(Pattern):
|
|
compl = ""
|
|
complement_letters = {"A": "T", "T": "A", "C": "G", "G": "C"}
|
|
for char in Pattern:
|
|
compl += complement_letters[char]
|
|
return compl
|
|
#+END_SRC
|
|
|
|
After using our function on the /Vibrio Cholerae's/ genome, we realize that some
|
|
of the frequent /k-mers/ are reverse complements of other frequent ones.
|
|
|
|
***** Exercise: Find a subsequence within a sequence
|
|
|
|
We're going to find the ocurrences of a subsquence inside a sequence,
|
|
and save the index of the first letter in the sequence.
|
|
|
|
#+BEGIN_SRC python
|
|
def PatternMatching(Pattern, Genome):
|
|
positions = []
|
|
for i in range(len(Genome)-len(Pattern)+1):
|
|
if Genome[i:i+len(Pattern)] == Pattern:
|
|
positions.append(i)
|
|
return positions
|
|
#+END_SRC
|
|
|
|
We find out that the /9-mers/ with the highest frequency appear in cluster.
|
|
There is strong statistical evidence that our subsequences are /DnaA boxes/.
|
|
|
|
|
|
**** Computational approaches to find ori in any bacteria
|
|
|
|
Now that we're pretty confident about the /DnaA boxes/ sequences that we found,
|
|
we are going to check if they are a common pattern in the rest of bacterias.
|
|
We're going to find the ocurrences of the sequences in /Thermotoga petrophila/:
|
|
|
|
#+BEGIN_SRC python
|
|
def PatternCount(Text, Pattern):
|
|
count = 0
|
|
for i in range(len(Text)-len(Pattern)+1):
|
|
if Text[i:i+len(Pattern)] == Pattern:
|
|
count = count+1
|
|
return count
|
|
#+END_SRC
|
|
|
|
We observe that there are *no* ocurrences of the sequences found in
|
|
/Vibrio Cholerae/. We can conclude that different bacterias have
|
|
different /DnaA boxes/.
|
|
|
|
We have to try another computational approach,
|
|
find clusters of /k-mers/ repeated in a small interval.
|
|
|
|
** Week 2
|
|
*** DNA replication (II)
|
|
|
|
**** Replication process
|
|
|
|
The /DNA polymerases/ start replicating while the parent strands are unraveling.
|
|
On the lagging strand, the DNA polymerase waits until the replication fork
|
|
opens around 2000 nucleotides, and because of that it forms Okazaki fragments.
|
|
We need 1 primer for the leading strand and 1 primer per Okazaki fragment
|
|
for the lagging strand. While the Okazaki fragments are being synthetized,
|
|
a /DNA ligase/ starts joining the fragments together.
|
|
|
|
**** Computational approach to find ori using deamination
|
|
|
|
As the lagging strand is always waiting for the helicase to go forward, the
|
|
lagging strand is mostly in single-stranded configuration,
|
|
which is more prone to mutations. One frequent form of mutation is
|
|
*deamination*,a process that causes cytosine to convert into thymine.
|
|
This means that cytosine is more frequent in half of the genome.
|
|
|
|
***** Exercise: count the ocurrences of cytosine
|
|
|
|
We're going to count the ocurrences of the bases in a genome and include them in
|
|
a symbol array.
|
|
|
|
#+BEGIN_SRC python
|
|
def SymbolArray(Genome, symbol):
|
|
array = {}
|
|
n = len(Genome)
|
|
ExtendedGenome = Genome + Genome[0:n//2]
|
|
for i in range(n):
|
|
array[i] = PatternCount(ExtendedGenome[i:i+(n//2)], symbol)
|
|
return array
|
|
|
|
|
|
def PatternCount(Text, Pattern):
|
|
count = 0
|
|
for i in range(len(Text)-len(Pattern)+1):
|
|
if Text[i:i+len(Pattern)] == Pattern:
|
|
count = count+1
|
|
return count
|
|
#+END_SRC
|
|
|
|
After executing the program, we realize that the algorithm is too inefficient.
|
|
|
|
***** Exercise: find a better algorithm for the previous exercise
|
|
|
|
This time, we are going to evaluate an element /i+1/, using the element /i/.
|
|
|
|
#+BEGIN_SRC python
|
|
def FasterSymbolArray(Genome, symbol):
|
|
array = {}
|
|
n = len(Genome)
|
|
ExtendedGenome = Genome + Genome[0:n//2]
|
|
array[0] = PatternCount(symbol, Genome[0:n//2])
|
|
for i in range(1, n):
|
|
array[i] = array[i-1]
|
|
if ExtendedGenome[i-1] == symbol:
|
|
array[i] = array[i]-1
|
|
if ExtendedGenome[i+(n//2)-1] == symbol:
|
|
array[i] = array[i]+1
|
|
return array
|
|
|
|
|
|
def PatternCount(Text, Pattern):
|
|
count = 0
|
|
for i in range(len(Text)-len(Pattern)+1):
|
|
if Text[i:i+len(Pattern)] == Pattern:
|
|
count = count+1
|
|
return count
|
|
#+END_SRC
|
|
|
|
It's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/.
|
|
In /Escherichia Coli/ we plotted the result of our program:
|
|
|
|
#+CAPTION: Symbol array for Cytosine in E. Coli Genome]
|
|
[[./Assets/e-coli.png]]
|
|
|
|
We can conclude that ori is located around position 4000000,
|
|
because that's where the Cytosine concentration is the lowest,
|
|
which indicates that the region stays single-stranded for the longest time.
|
|
|
|
**** The Skew Diagram
|
|
|
|
Usually scientists measure the difference between /G - C/, which is
|
|
*higher on the lagging strand* and *lower on the leading strand*.
|
|
|
|
***** Exercise: Synthetize a Skew Array
|
|
|
|
We're going to make a Skew Diagram, for that we'll first need a Skew Array. For
|
|
that purpose we wrote:
|
|
|
|
#+BEGIN_SRC python
|
|
def SkewArray(Genome):
|
|
Skew = []
|
|
Skew.append(0)
|
|
for i in range(0, len(Genome)):
|
|
if Genome[i] == "G":
|
|
Skew.append(Skew[i] + 1)
|
|
elif Genome[i] == "C":
|
|
Skew.append(Skew[i] - 1)
|
|
else:
|
|
Skew.append(Skew[i])
|
|
return Skew
|
|
#+END_SRC
|
|
|
|
We can see the utility of a Skew Diagram looking at the one from /Escherichia Coli/:
|
|
|
|
#+CAPTION: Symbol array for Cytosine in E. Coli Genome]
|
|
[[./Assets/skew_diagram.png]]
|
|
|
|
Ori should be located where the skew is at its minimum value.
|
|
|
|
***** Exercise: Efficient algorithm for locating ori
|
|
|
|
Now that we know more about ori's skew value, we're going to construct a better
|
|
algorithm to find it:
|
|
|
|
#+BEGIN_SRC python
|
|
def MinimumSkew(Genome):
|
|
positions = []
|
|
skew = SkewArray(Genome)
|
|
minimum = min(skew)
|
|
return [i for i in range(0, len(Genome)) if skew[i] == minimum]
|
|
|
|
|
|
def SkewArray(Genome):
|
|
Skew = []
|
|
Skew.append(0)
|
|
for i in range(0, len(Genome)):
|
|
if Genome[i] == "G":
|
|
Skew.append(Skew[i] + 1)
|
|
elif Genome[i] == "C":
|
|
Skew.append(Skew[i] - 1)
|
|
else:
|
|
Skew.append(Skew[i])
|
|
return Skew
|
|
#+END_SRC
|
|
|
|
|
|
**** Finding /DnaA boxes/
|
|
|
|
When we look for /DnaA boxes/ in the minimal skew region,
|
|
we can't find highly repeated /9-mers/ in /Escherichia Coli/. But we found
|
|
approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide.
|
|
|
|
***** Exercise: Calculate Hamming distance
|
|
|
|
The Hamming distance is the number of mismatches between 2 strings.
|
|
|
|
#+BEGIN_SRC python
|
|
def HammingDistance(p, q):
|
|
count = 0
|
|
for i in range(0, len(p)):
|
|
if p[i] != q[i]:
|
|
count += 1
|
|
return count
|
|
#+END_SRC
|
|
|
|
***** Exercise: Find approximate patterns
|
|
|
|
Now that we have our Hamming distance, we use it to find
|
|
the approximate sequences:
|
|
|
|
#+BEGIN_SRC python
|
|
def ApproximatePatternMatching(Text, Pattern, d):
|
|
positions = []
|
|
for i in range(len(Text)-len(Pattern)+1):
|
|
if Text[i:i+len(Pattern)] == Pattern:
|
|
positions.append(i)
|
|
elif HammingDistance(Text[i:i+len(Pattern)], Pattern) <= d:
|
|
positions.append(i)
|
|
return positions
|
|
|
|
|
|
def HammingDistance(p, q):
|
|
count = 0
|
|
for i in range(0, len(p)):
|
|
if p[i] != q[i]:
|
|
count += 1
|
|
return count
|
|
#+END_SRC
|
|
|
|
***** Exercise: Count the approximate patterns
|
|
|
|
The final part is counting the approximate sequences:
|
|
|
|
#+BEGIN_SRC python
|
|
def ApproximatePatternCount(Pattern, Text, d):
|
|
count = 0
|
|
for i in range(len(Text) - len(Pattern) + 1):
|
|
if (
|
|
Text[i : i + len(Pattern)] == Pattern
|
|
or HammingDistance(Text[i : i + len(Pattern)], Pattern) <= d
|
|
):
|
|
count += 1
|
|
return count
|
|
|
|
|
|
def HammingDistance(p, q):
|
|
count = 0
|
|
for i in range(0, len(p)):
|
|
if p[i] != q[i]:
|
|
count += 1
|
|
return count
|
|
#+END_SRC
|
|
|
|
After trying out our ApproximatePatternCount in the hypothesized ori region,
|
|
we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/.
|
|
We've finally found a computational method to find ori that seems correct.
|
|
|
|
** Week 3
|
|
|
|
*** The circadian clock
|
|
|
|
Variation in gene expression permits the cell to keep track of time.
|
|
|
|
**** Computational approaches to find regulatory motifs
|
|
|
|
***** Exercise: Find the most common nucleotides in each position
|
|
|
|
We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string.
|
|
In each position, we'll insert the most frequent nucleotide, in upper case,
|
|
and the nucleotide in lower case (if there's no popular one).
|
|
Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most
|
|
upper case letters. We'll use a *4 x k* Count Matrix, one row for each base.
|
|
|
|
#+BEGIN_SRC python
|
|
def Count(Motifs):
|
|
k = len(Motifs[0])
|
|
count = {'A': [0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0] * k}
|
|
t = len(Motifs)
|
|
for i in range(t):
|
|
for j in range(k):
|
|
symbol = Motifs[i][j]
|
|
count[symbol][j] += 1
|
|
return count
|
|
#+END_SRC
|
|
|
|
Now that we have a Count Matrix, we will generate a Profile Matrix, which has
|
|
the frequency of the nucleotide instead of the count:
|
|
|
|
#+BEGIN_SRC python
|
|
def Profile(Motifs):
|
|
t = len(Motifs)
|
|
profile = Count(Motifs)
|
|
for key, v in profile.items():
|
|
v[:] = [x / t for x in v]
|
|
return profile
|
|
|
|
|
|
def Count(Motifs):
|
|
k = len(Motifs[0])
|
|
count = {'A': [0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0] * k}
|
|
t = len(Motifs)
|
|
for i in range(t):
|
|
for j in range(k):
|
|
symbol = Motifs[i][j]
|
|
count[symbol][j] += 1
|
|
return count
|
|
#+END_SRC
|
|
|
|
***** Exercise: Form the most frequent sequence of nucleotides
|
|
|
|
Finally, we can form a Consensus string, to get a candidate regulatory motif:
|
|
#+BEGIN_SRC python
|
|
def Consensus(Motifs):
|
|
consensus = ""
|
|
count = Count(Motifs)
|
|
k = len(Motifs[0])
|
|
for j in range(k):
|
|
m = 0
|
|
frequentSymbol = ""
|
|
for symbol in "ACGT":
|
|
if count[symbol][j] > m:
|
|
m = count[symbol][j]
|
|
frequentSymbol = symbol
|
|
consensus += frequentSymbol
|
|
return consensus
|
|
|
|
|
|
def Count(Motifs):
|
|
k = len(Motifs[0])
|
|
count = {'A': [0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0] * k}
|
|
t = len(Motifs)
|
|
for i in range(t):
|
|
for j in range(k):
|
|
symbol = Motifs[i][j]
|
|
count[symbol][j] += 1
|
|
return count
|
|
#+END_SRC
|
|
|
|
After obtaining the Consensus string, all we need to do is obtain the total
|
|
score of our selected /k-mers/:
|
|
|
|
#+BEGIN_SRC python
|
|
def Score(Motifs):
|
|
score = 0
|
|
for i in range(len(Motifs)):
|
|
score += HammingDistance(Motifs[i], Consensus(Motifs))
|
|
return score
|
|
|
|
|
|
def Consensus(Motifs):
|
|
consensus = ""
|
|
count = Count(Motifs)
|
|
k = len(Motifs[0])
|
|
for j in range(k):
|
|
m = 0
|
|
frequentSymbol = ""
|
|
for symbol in "ACGT":
|
|
if count[symbol][j] > m:
|
|
m = count[symbol][j]
|
|
frequentSymbol = symbol
|
|
consensus += frequentSymbol
|
|
return consensus
|
|
|
|
|
|
def Count(Motifs):
|
|
k = len(Motifs[0])
|
|
count = {'A': [0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0] * k}
|
|
t = len(Motifs)
|
|
for i in range(t):
|
|
for j in range(k):
|
|
symbol = Motifs[i][j]
|
|
count[symbol][j] += 1
|
|
return count
|
|
|
|
|
|
def HammingDistance(p, q):
|
|
count = 0
|
|
for i in range(0, len(p)):
|
|
if p[i] != q[i]:
|
|
count += 1
|
|
return count
|
|
#+END_SRC
|
|
|
|
***** Exercise: Find a set of /k-mers/ that minimize the score
|
|
|
|
|
|
Applying a brute force approach for this task is not viable, we'll use a Greedy
|
|
Algorithm. We first have to determine the probability of a sequence:
|
|
|
|
#+BEGIN_SRC python
|
|
def Pr(Text, Profile):
|
|
probability = 1
|
|
k = len(Text)
|
|
for i in range(k):
|
|
probability *= Profile[Text[i]][i]
|
|
return probability
|
|
#+END_SRC
|
|
|
|
We'll use this function to find the most probable /k-mer/ in a sequence:
|
|
|
|
#+BEGIN_SRC python
|
|
def ProfileMostProbableKmer(text, k, profile):
|
|
kmer = ""
|
|
keys = ["A", "C", "G", "T"]
|
|
d = dict(zip(keys,profile))
|
|
prob = -1
|
|
for i in range(len(text)-k+1):
|
|
if (Pr((text[i:i+k]), profile) > prob):
|
|
prob = Pr(text[i:i+k], profile)
|
|
kmer = text[i:i+k]
|
|
return kmer
|
|
|
|
|
|
def Pr(Text, Profile):
|
|
probability = 1
|
|
k = len(Text)
|
|
for i in range(k):
|
|
probability *= Profile[Text[i]][i]
|
|
return probability
|
|
#+END_SRC
|
|
|
|
Now we're finally ready to assemble all the pieces and implement a Greedy Motif
|
|
Search Algorithm:
|
|
|
|
#+BEGIN_SRC python
|
|
def GreedyMotifSearch(Dna, k, t):
|
|
BestMotifs = []
|
|
for i in range(0, t):
|
|
BestMotifs.append(Dna[i][0:k])
|
|
n = len(Dna[0])
|
|
for i in range(n-k+1):
|
|
Motifs = []
|
|
Motifs.append(Dna[0][i:i+k])
|
|
for j in range(1, t):
|
|
P = Profile(Motifs[0:j])
|
|
Motifs.append(ProfileMostProbableKmer(Dna[j], k, P))
|
|
if Score(Motifs) < Score(BestMotifs):
|
|
BestMotifs = Motifs
|
|
return BestMotifs
|
|
|
|
def Score(Motifs):
|
|
score = 0
|
|
for i in range(len(Motifs)):
|
|
score += HammingDistance(Motifs[i], Consensus(Motifs))
|
|
return score
|
|
|
|
|
|
def Consensus(Motifs):
|
|
consensus = ""
|
|
count = Count(Motifs)
|
|
k = len(Motifs[0])
|
|
for j in range(k):
|
|
m = 0
|
|
frequentSymbol = ""
|
|
for symbol in "ACGT":
|
|
if count[symbol][j] > m:
|
|
m = count[symbol][j]
|
|
frequentSymbol = symbol
|
|
consensus += frequentSymbol
|
|
return consensus
|
|
|
|
|
|
def Count(Motifs):
|
|
k = len(Motifs[0])
|
|
count = {'A': [0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0] * k}
|
|
t = len(Motifs)
|
|
for i in range(t):
|
|
for j in range(k):
|
|
symbol = Motifs[i][j]
|
|
count[symbol][j] += 1
|
|
return count
|
|
|
|
|
|
def HammingDistance(p, q):
|
|
count = 0
|
|
for i in range(0, len(p)):
|
|
if p[i] != q[i]:
|
|
count += 1
|
|
return count
|
|
|
|
|
|
def Profile(Motifs):
|
|
t = len(Motifs)
|
|
profile = Count(Motifs)
|
|
for key, v in profile.items():
|
|
v[:] = [x / t for x in v]
|
|
return profile
|
|
|
|
|
|
def ProfileMostProbableKmer(text, k, profile):
|
|
kmer = ""
|
|
keys = ["A", "C", "G", "T"]
|
|
d = dict(zip(keys,profile))
|
|
prob = -1
|
|
for i in range(len(text)-k+1):
|
|
if (Pr((text[i:i+k]), profile) > prob):
|
|
prob = Pr(text[i:i+k], profile)
|
|
kmer = text[i:i+k]
|
|
return kmer
|
|
|
|
|
|
def Pr(Text, Profile):
|
|
probability = 1
|
|
k = len(Text)
|
|
for i in range(k):
|
|
probability *= Profile[Text[i]][i]
|
|
return probability
|
|
#+END_SRC
|
|
|
|
***** Motifs in tuberculosis
|
|
|
|
Tuberculosis is an infectious disease, caused by a bacteria called /Mycobacterium
|
|
tuberculosis/. The bacteria can stay latent in the host for decades, in hypoxic
|
|
environments.
|
|
Our Greedy Algorithm can help us identify a motif that might be involved
|
|
in the process.
|
|
|
|
The transcription factor behind this behaviour is *DosR*, we'll identify the
|
|
binding sites:
|
|
|
|
#+BEGIN_SRC python
|
|
def GreedyMotifSearch(Dna, k, t):
|
|
BestMotifs = []
|
|
for i in range(0, t):
|
|
BestMotifs.append(Dna[i][0:k])
|
|
n = len(Dna[0])
|
|
for i in range(n - k + 1):
|
|
Motifs = []
|
|
Motifs.append(Dna[0][i : i + k])
|
|
for j in range(1, t):
|
|
P = Profile(Motifs[0:j])
|
|
Motifs.append(ProfileMostProbableKmer(Dna[j], k, P))
|
|
if Score(Motifs) < Score(BestMotifs):
|
|
BestMotifs = Motifs
|
|
return BestMotifs
|
|
|
|
|
|
def Score(Motifs):
|
|
score = 0
|
|
for i in range(len(Motifs)):
|
|
score += HammingDistance(Motifs[i], Consensus(Motifs))
|
|
return score
|
|
|
|
|
|
def Consensus(Motifs):
|
|
consensus = ""
|
|
count = Count(Motifs)
|
|
k = len(Motifs[0])
|
|
for j in range(k):
|
|
m = 0
|
|
frequentSymbol = ""
|
|
for symbol in "ACGT":
|
|
if count[symbol][j] > m:
|
|
m = count[symbol][j]
|
|
frequentSymbol = symbol
|
|
consensus += frequentSymbol
|
|
return consensus
|
|
|
|
|
|
def Count(Motifs):
|
|
k = len(Motifs[0])
|
|
count = {"A": [0] * k, "C": [0] * k, "G": [0] * k, "T": [0] * k}
|
|
t = len(Motifs)
|
|
for i in range(t):
|
|
for j in range(k):
|
|
symbol = Motifs[i][j]
|
|
count[symbol][j] += 1
|
|
return count
|
|
|
|
|
|
def HammingDistance(p, q):
|
|
count = 0
|
|
for i in range(0, len(p)):
|
|
if p[i] != q[i]:
|
|
count += 1
|
|
return count
|
|
|
|
|
|
def Profile(Motifs):
|
|
t = len(Motifs)
|
|
profile = Count(Motifs)
|
|
for key, v in profile.items():
|
|
v[:] = [x / t for x in v]
|
|
return profile
|
|
|
|
|
|
def ProfileMostProbableKmer(text, k, profile):
|
|
kmer = ""
|
|
keys = ["A", "C", "G", "T"]
|
|
d = dict(zip(keys, profile))
|
|
prob = -1
|
|
for i in range(len(text) - k + 1):
|
|
if Pr((text[i : i + k]), profile) > prob:
|
|
prob = Pr(text[i : i + k], profile)
|
|
kmer = text[i : i + k]
|
|
return kmer
|
|
|
|
|
|
def Pr(Text, Profile):
|
|
probability = 1
|
|
k = len(Text)
|
|
for i in range(k):
|
|
probability *= Profile[Text[i]][i]
|
|
return probability
|
|
|
|
|
|
Dna = [
|
|
"GCGCCCCGCCCGGACAGCCATGCGCTAACCCTGGCTTCGATGGCGCCGGCTCAGTTAGGGCCGGAAGTCCCCAATGTGGCAGACCTTTCGCCCCTGGCGGACGAATGACCCCAGTGGCCGGGACTTCAGGCCCTATCGGAGGGCTCCGGCGCGGTGGTCGGATTTGTCTGTGGAGGTTACACCCCAATCGCAAGGATGCATTATGACCAGCGAGCTGAGCCTGGTCGCCACTGGAAAGGGGAGCAACATC",
|
|
"CCGATCGGCATCACTATCGGTCCTGCGGCCGCCCATAGCGCTATATCCGGCTGGTGAAATCAATTGACAACCTTCGACTTTGAGGTGGCCTACGGCGAGGACAAGCCAGGCAAGCCAGCTGCCTCAACGCGCGCCAGTACGGGTCCATCGACCCGCGGCCCACGGGTCAAACGACCCTAGTGTTCGCTACGACGTGGTCGTACCTTCGGCAGCAGATCAGCAATAGCACCCCGACTCGAGGAGGATCCCG",
|
|
"ACCGTCGATGTGCCCGGTCGCGCCGCGTCCACCTCGGTCATCGACCCCACGATGAGGACGCCATCGGCCGCGACCAAGCCCCGTGAAACTCTGACGGCGTGCTGGCCGGGCTGCGGCACCTGATCACCTTAGGGCACTTGGGCCACCACAACGGGCCGCCGGTCTCGACAGTGGCCACCACCACACAGGTGACTTCCGGCGGGACGTAAGTCCCTAACGCGTCGTTCCGCACGCGGTTAGCTTTGCTGCC",
|
|
"GGGTCAGGTATATTTATCGCACACTTGGGCACATGACACACAAGCGCCAGAATCCCGGACCGAACCGAGCACCGTGGGTGGGCAGCCTCCATACAGCGATGACCTGATCGATCATCGGCCAGGGCGCCGGGCTTCCAACCGTGGCCGTCTCAGTACCCAGCCTCATTGACCCTTCGACGCATCCACTGCGCGTAAGTCGGCTCAACCCTTTCAAACCGCTGGATTACCGACCGCAGAAAGGGGGCAGGAC",
|
|
"GTAGGTCAAACCGGGTGTACATACCCGCTCAATCGCCCAGCACTTCGGGCAGATCACCGGGTTTCCCCGGTATCACCAATACTGCCACCAAACACAGCAGGCGGGAAGGGGCGAAAGTCCCTTATCCGACAATAAAACTTCGCTTGTTCGACGCCCGGTTCACCCGATATGCACGGCGCCCAGCCATTCGTGACCGACGTCCCCAGCCCCAAGGCCGAACGACCCTAGGAGCCACGAGCAATTCACAGCG",
|
|
"CCGCTGGCGACGCTGTTCGCCGGCAGCGTGCGTGACGACTTCGAGCTGCCCGACTACACCTGGTGACCACCGCCGACGGGCACCTCTCCGCCAGGTAGGCACGGTTTGTCGCCGGCAATGTGACCTTTGGGCGCGGTCTTGAGGACCTTCGGCCCCACCCACGAGGCCGCCGCCGGCCGATCGTATGACGTGCAATGTACGCCATAGGGTGCGTGTTACGGCGATTACCTGAAGGCGGCGGTGGTCCGGA",
|
|
"GGCCAACTGCACCGCGCTCTTGATGACATCGGTGGTCACCATGGTGTCCGGCATGATCAACCTCCGCTGTTCGATATCACCCCGATCTTTCTGAACGGCGGTTGGCAGACAACAGGGTCAATGGTCCCCAAGTGGATCACCGACGGGCGCGGACAAATGGCCCGCGCTTCGGGGACTTCTGTCCCTAGCCCTGGCCACGATGGGCTGGTCGGATCAAAGGCATCCGTTTCCATCGATTAGGAGGCATCAA",
|
|
"GTACATGTCCAGAGCGAGCCTCAGCTTCTGCGCAGCGACGGAAACTGCCACACTCAAAGCCTACTGGGCGCACGTGTGGCAACGAGTCGATCCACACGAAATGCCGCCGTTGGGCCGCGGACTAGCCGAATTTTCCGGGTGGTGACACAGCCCACATTTGGCATGGGACTTTCGGCCCTGTCCGCGTCCGTGTCGGCCAGACAAGCTTTGGGCATTGGCCACAATCGGGCCACAATCGAAAGCCGAGCAG",
|
|
"GGCAGCTGTCGGCAACTGTAAGCCATTTCTGGGACTTTGCTGTGAAAAGCTGGGCGATGGTTGTGGACCTGGACGAGCCACCCGTGCGATAGGTGAGATTCATTCTCGCCCTGACGGGTTGCGTCTGTCATCGGTCGATAAGGACTAACGGCCCTCAGGTGGGGACCAACGCCCCTGGGAGATAGCGGTCCCCGCCAGTAACGTACCGCTGAACCGACGGGATGTATCCGCCCCAGCGAAGGAGACGGCG",
|
|
"TCAGCACCATGACCGCCTGGCCACCAATCGCCCGTAACAAGCGGGACGTCCGCGACGACGCGTGCGCTAGCGCCGTGGCGGTGACAACGACCAGATATGGTCCGAGCACGCGGGCGAACCTCGTGTTCTGGCCTCGGCCAGTTGTGTAGAGCTCATCGCTGTCATCGAGCGATATCCGACCACTGATCCAAGTCGGGGGCTCTGGGGACCGAAGTCCCCGGGCTCGGAGCTATCGGACCTCACGATCACC",
|
|
]
|
|
|
|
t = len(Dna)
|
|
k = 15
|
|
|
|
Motifs = GreedyMotifSearch(Dna, k, t)
|
|
|
|
print(Motifs)
|
|
print(Score(Motifs))
|
|
#+END_SRC
|
|
|
|
#+RESULTS:
|
|
: ['GTTAGGGCCGGAAGT', 'CCGATCGGCATCACT', 'ACCGTCGATGTGCCC', 'GGGTCAGGTATATTT', 'GTGACCGACGTCCCC', 'CTGTTCGCCGGCAGC', 'CTGTTCGATATCACC', 'GTACATGTCCAGAGC', 'GCGATAGGTGAGATT', 'CTCATCGCTGTCATC']
|
|
: 64
|
|
|
|
Our algorithm is pretty fast, but it's not optimal, and that's just a
|
|
characteristic of Greedy Algorithms, they trade optimality for speed.
|
|
|
|
** Vocabulary
|
|
- k-mer: subsquences of length /k/ in a biological sequence
|
|
- Frequency map: sequence --> frequency of the sequence
|