Refactor after proof reading
This commit is contained in:
parent
08e92db1eb
commit
98cc6ebdb9
122
Notebook.org
122
Notebook.org
@ -1,19 +1,19 @@
|
|||||||
* Biology Meets Programming: Bioinformatics for Beginners
|
* Biology Meets Programming: Bioinformatics for Beginners
|
||||||
|
|
||||||
** Week 1
|
** Week 1
|
||||||
|
|
||||||
*** DNA replication
|
*** DNA replication
|
||||||
|
|
||||||
**** Origin of replication (ori)
|
**** Origin of replication (ori)
|
||||||
|
|
||||||
Locating an ori is key for gene therapy (e.g. viral vectors), to introduce a theraupetic gene.
|
Locating an ori is key for gene therapy (e.g. viral vectors),
|
||||||
|
to introduce a theraupetic gene.
|
||||||
|
|
||||||
**** Computational approaches to find ori in Vibrio Cholerae
|
**** Computational approaches to find ori in Vibrio Cholerae
|
||||||
|
|
||||||
***** Exercise: find Pattern
|
***** Exercise: find Pattern
|
||||||
|
|
||||||
We'll look for the *DnaA box* sequence, using a sliding window, in that case we will use this function to find out how many times
|
We'll look for the *DnaA box* sequence, using a sliding window, in that case
|
||||||
does a sequence appear in the genome:
|
we will use this function to find out how many times
|
||||||
|
a sequence appears in the genome:
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def PatternCount(Text, Pattern):
|
def PatternCount(Text, Pattern):
|
||||||
@ -25,7 +25,7 @@ def PatternCount(Text, Pattern):
|
|||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
For the second part, we're going to calculate the frequency map of the sequences
|
For the second part, we're going to calculate the frequency map of the sequences
|
||||||
of length /k/, for that purpose we'll use:
|
of length /k/:
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def FrequentWords(Text, k):
|
def FrequentWords(Text, k):
|
||||||
@ -52,8 +52,8 @@ def FrequencyMap(Text, k):
|
|||||||
|
|
||||||
***** Exercise: Find the reverse complement of a sequence
|
***** Exercise: Find the reverse complement of a sequence
|
||||||
|
|
||||||
We're going to generate the reverse complement of a sequence, which is the complement of a sequence, read in the same direction (5' -> 3').
|
We're going to generate the reverse complement of a sequence,
|
||||||
In this case, we're going to use:
|
which is the complement of a sequence, read in the same direction (5' -> 3').
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def ReverseComplement(Pattern):
|
def ReverseComplement(Pattern):
|
||||||
@ -75,12 +75,13 @@ def Complement(Pattern):
|
|||||||
return compl
|
return compl
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
After using our function on the /Vibrio Cholerae's/ genome, we realize that some of the frequent /k-mers/ are reverse complements of other frequent ones.
|
After using our function on the /Vibrio Cholerae's/ genome, we realize that some
|
||||||
|
of the frequent /k-mers/ are reverse complements of other frequent ones.
|
||||||
|
|
||||||
***** Exercise: Find a subsequence within a sequence
|
***** Exercise: Find a subsequence within a sequence
|
||||||
|
|
||||||
We're going to find the ocurrences of a subsquence inside a sequence, and save the index of the first letter in the sequence.
|
We're going to find the ocurrences of a subsquence inside a sequence,
|
||||||
This time, we'll use:
|
and save the index of the first letter in the sequence.
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def PatternMatching(Pattern, Genome):
|
def PatternMatching(Pattern, Genome):
|
||||||
@ -91,15 +92,15 @@ def PatternMatching(Pattern, Genome):
|
|||||||
return positions
|
return positions
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
After using our function on the /Vibrio Cholerae's/ genome, we find out that the /9-mers/ with the highest frequency appear in cluster.
|
We find out that the /9-mers/ with the highest frequency appear in cluster.
|
||||||
This is strong statistical evidence that our subsequences are /DnaA boxes/.
|
There is strong statistical evidence that our subsequences are /DnaA boxes/.
|
||||||
|
|
||||||
|
|
||||||
**** Computational approaches to find ori in any bacteria
|
**** Computational approaches to find ori in any bacteria
|
||||||
|
|
||||||
Now that we're pretty confident about the /DnaA boxes/ sequences that we found, we are going to check if they are a common pattern in the rest of bacterias.
|
Now that we're pretty confident about the /DnaA boxes/ sequences that we found,
|
||||||
We're going to find the ocurrences of the sequences in /Thermotoga petrophila/
|
we are going to check if they are a common pattern in the rest of bacterias.
|
||||||
with:
|
We're going to find the ocurrences of the sequences in /Thermotoga petrophila/:
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def PatternCount(Text, Pattern):
|
def PatternCount(Text, Pattern):
|
||||||
@ -110,31 +111,37 @@ def PatternCount(Text, Pattern):
|
|||||||
return count
|
return count
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
After the execution, we observe that there are *no* ocurrences of the sequences found in /Vibrio Cholerae/.
|
We observe that there are *no* ocurrences of the sequences found in
|
||||||
We can conclude that different bacterias have different /DnaA boxes/.
|
/Vibrio Cholerae/. We can conclude that different bacterias have
|
||||||
|
different /DnaA boxes/.
|
||||||
|
|
||||||
We have to try another computational approach then, find clusters of /k-mers/ repeated in a small interval.
|
We have to try another computational approach,
|
||||||
|
find clusters of /k-mers/ repeated in a small interval.
|
||||||
|
|
||||||
** Week 2
|
** Week 2
|
||||||
|
|
||||||
*** DNA replication (II)
|
*** DNA replication (II)
|
||||||
|
|
||||||
**** Replication process
|
**** Replication process
|
||||||
|
|
||||||
The /DNA polymerases/ start replicating while the parent strands are unraveling.
|
The /DNA polymerases/ start replicating while the parent strands are unraveling.
|
||||||
On the lagging strand, the DNA polymerase waits until the replication fork opens around 2000 nucleotides, and because of that it forms Okazaki fragments.
|
On the lagging strand, the DNA polymerase waits until the replication fork
|
||||||
We need 1 primer for the leading strand and 1 primer per Okazaki fragment for the lagging strand.
|
opens around 2000 nucleotides, and because of that it forms Okazaki fragments.
|
||||||
While the Okazaki fragments are being synthetized, a /DNA ligase/ starts joining the fragments together.
|
We need 1 primer for the leading strand and 1 primer per Okazaki fragment
|
||||||
|
for the lagging strand. While the Okazaki fragments are being synthetized,
|
||||||
|
a /DNA ligase/ starts joining the fragments together.
|
||||||
|
|
||||||
**** Computational approach to find ori using deamination
|
**** Computational approach to find ori using deamination
|
||||||
|
|
||||||
As the lagging strand is always waiting for the helicase to go forward, the lagging strand is mostly in single-stranded configuration, which is more prone to mutations.
|
As the lagging strand is always waiting for the helicase to go forward, the
|
||||||
One frequent form of mutation is *deamination*, a process that causes cytosine to convert into thymine. This means that cytosine is more frequent in half of the genome.
|
lagging strand is mostly in single-stranded configuration,
|
||||||
|
which is more prone to mutations. One frequent form of mutation is
|
||||||
|
*deamination*,a process that causes cytosine to convert into thymine.
|
||||||
|
This means that cytosine is more frequent in half of the genome.
|
||||||
|
|
||||||
***** Exercise: count the ocurrences of cytosine
|
***** Exercise: count the ocurrences of cytosine
|
||||||
|
|
||||||
We're going to count the ocurrences of the bases in a genome and include them in
|
We're going to count the ocurrences of the bases in a genome and include them in
|
||||||
a symbol array, for that purpose we'll use:
|
a symbol array.
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def SymbolArray(Genome, symbol):
|
def SymbolArray(Genome, symbol):
|
||||||
@ -159,7 +166,6 @@ After executing the program, we realize that the algorithm is too inefficient.
|
|||||||
***** Exercise: find a better algorithm for the previous exercise
|
***** Exercise: find a better algorithm for the previous exercise
|
||||||
|
|
||||||
This time, we are going to evaluate an element /i+1/, using the element /i/.
|
This time, we are going to evaluate an element /i+1/, using the element /i/.
|
||||||
We'll use the following algorithm:
|
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def FasterSymbolArray(Genome, symbol):
|
def FasterSymbolArray(Genome, symbol):
|
||||||
@ -184,18 +190,20 @@ def PatternCount(Text, Pattern):
|
|||||||
return count
|
return count
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
After executing the program we see that it's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/.
|
It's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/.
|
||||||
In /Escherichia Coli/ we plotted the result of our program:
|
In /Escherichia Coli/ we plotted the result of our program:
|
||||||
|
|
||||||
#+CAPTION: Symbol array for Cytosine in E. Coli Genome]
|
#+CAPTION: Symbol array for Cytosine in E. Coli Genome]
|
||||||
[[./Assets/e-coli.png]]
|
[[./Assets/e-coli.png]]
|
||||||
|
|
||||||
From that graph, we conclude that ori is located around position 4000000, because that's where the Cytosine concentration is the lowest,
|
We can conclude that ori is located around position 4000000,
|
||||||
|
because that's where the Cytosine concentration is the lowest,
|
||||||
which indicates that the region stays single-stranded for the longest time.
|
which indicates that the region stays single-stranded for the longest time.
|
||||||
|
|
||||||
**** The Skew Diagram
|
**** The Skew Diagram
|
||||||
|
|
||||||
Usually scientists measure the difference between /G - C/, which is *higher on the lagging strand* and *lower on the leading strand*.
|
Usually scientists measure the difference between /G - C/, which is
|
||||||
|
*higher on the lagging strand* and *lower on the leading strand*.
|
||||||
|
|
||||||
***** Exercise: Synthetize a Skew Array
|
***** Exercise: Synthetize a Skew Array
|
||||||
|
|
||||||
@ -252,17 +260,27 @@ def SkewArray(Genome):
|
|||||||
|
|
||||||
**** Finding /DnaA boxes/
|
**** Finding /DnaA boxes/
|
||||||
|
|
||||||
When we look for /DnaA boxes/ in the minimal skew region, we can't find highly repeated /9-mers/ in /Escherichia Coli/.
|
When we look for /DnaA boxes/ in the minimal skew region,
|
||||||
But we find approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide.
|
we can't find highly repeated /9-mers/ in /Escherichia Coli/. But we found
|
||||||
|
approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide.
|
||||||
|
|
||||||
***** Exercise: Calculate Hamming distance
|
***** Exercise: Calculate Hamming distance
|
||||||
|
|
||||||
The Hamming distance is the number of mismatches between 2 strings, we'll solve this problem in [[./Code/HammingDistance][HammingDistance]]
|
The Hamming distance is the number of mismatches between 2 strings.
|
||||||
|
|
||||||
|
#+BEGIN_SRC python
|
||||||
|
def HammingDistance(p, q):
|
||||||
|
count = 0
|
||||||
|
for i in range(0, len(p)):
|
||||||
|
if p[i] != q[i]:
|
||||||
|
count += 1
|
||||||
|
return count
|
||||||
|
#+END_SRC
|
||||||
|
|
||||||
***** Exercise: Find approximate patterns
|
***** Exercise: Find approximate patterns
|
||||||
|
|
||||||
Now that we have our Hamming distance, we have to find the approximate
|
Now that we have our Hamming distance, we use it to find
|
||||||
sequences:
|
the approximate sequences:
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def ApproximatePatternMatching(Text, Pattern, d):
|
def ApproximatePatternMatching(Text, Pattern, d):
|
||||||
@ -283,7 +301,6 @@ def HammingDistance(p, q):
|
|||||||
return count
|
return count
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
|
|
||||||
***** Exercise: Count the approximate patterns
|
***** Exercise: Count the approximate patterns
|
||||||
|
|
||||||
The final part is counting the approximate sequences:
|
The final part is counting the approximate sequences:
|
||||||
@ -291,8 +308,11 @@ The final part is counting the approximate sequences:
|
|||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def ApproximatePatternCount(Pattern, Text, d):
|
def ApproximatePatternCount(Pattern, Text, d):
|
||||||
count = 0
|
count = 0
|
||||||
for i in range(len(Text)-len(Pattern)+1):
|
for i in range(len(Text) - len(Pattern) + 1):
|
||||||
if Text[i:i+len(Pattern)] == Pattern or HammingDistance(Text[i:i+len(Pattern)], Pattern) <= d:
|
if (
|
||||||
|
Text[i : i + len(Pattern)] == Pattern
|
||||||
|
or HammingDistance(Text[i : i + len(Pattern)], Pattern) <= d
|
||||||
|
):
|
||||||
count += 1
|
count += 1
|
||||||
return count
|
return count
|
||||||
|
|
||||||
@ -305,8 +325,8 @@ def HammingDistance(p, q):
|
|||||||
return count
|
return count
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
|
After trying out our ApproximatePatternCount in the hypothesized ori region,
|
||||||
After trying out our ApproximatePatternCount in the hypothesized ori region, we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/.
|
we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/.
|
||||||
We've finally found a computational method to find ori that seems correct.
|
We've finally found a computational method to find ori that seems correct.
|
||||||
|
|
||||||
** Week 3
|
** Week 3
|
||||||
@ -319,11 +339,11 @@ Variation in gene expression permits the cell to keep track of time.
|
|||||||
|
|
||||||
***** Exercise: Find the most common nucleotides in each position
|
***** Exercise: Find the most common nucleotides in each position
|
||||||
|
|
||||||
We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string. In each position, we'll insert the most frequent nucleotide, in upper case,
|
We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string.
|
||||||
|
In each position, we'll insert the most frequent nucleotide, in upper case,
|
||||||
and the nucleotide in lower case (if there's no popular one).
|
and the nucleotide in lower case (if there's no popular one).
|
||||||
Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most upper case letters.
|
Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most
|
||||||
We'll use a *4 x k* Count Matrix, one row for each base. We'll first generate
|
upper case letters. We'll use a *4 x k* Count Matrix, one row for each base.
|
||||||
the Matrix:
|
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def Count(Motifs):
|
def Count(Motifs):
|
||||||
@ -390,7 +410,7 @@ def Count(Motifs):
|
|||||||
return count
|
return count
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
After obtaining the Consensus string, all we need to do is obtains the total
|
After obtaining the Consensus string, all we need to do is obtain the total
|
||||||
score of our selected /k-mers/:
|
score of our selected /k-mers/:
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
@ -438,8 +458,8 @@ def HammingDistance(p, q):
|
|||||||
***** Exercise: Find a set of /k-mers/ that minimize the score
|
***** Exercise: Find a set of /k-mers/ that minimize the score
|
||||||
|
|
||||||
|
|
||||||
Applying a brute force approach for this task is not viable, we'll use a Greedy Algorithm. For that, we'll first determine the probability
|
Applying a brute force approach for this task is not viable, we'll use a Greedy
|
||||||
of a sequence, we'll use:
|
Algorithm. We first have to determine the probability of a sequence:
|
||||||
|
|
||||||
#+BEGIN_SRC python
|
#+BEGIN_SRC python
|
||||||
def Pr(Text, Profile):
|
def Pr(Text, Profile):
|
||||||
@ -563,15 +583,16 @@ def Pr(Text, Profile):
|
|||||||
|
|
||||||
***** Motifs in tuberculosis
|
***** Motifs in tuberculosis
|
||||||
|
|
||||||
Tuberculosis is an infectious disease, cause by a bacteria called /Mycobacterium
|
Tuberculosis is an infectious disease, caused by a bacteria called /Mycobacterium
|
||||||
tuberculosis/. The bacteria can stay latent in the host for decades, in hypoxic
|
tuberculosis/. The bacteria can stay latent in the host for decades, in hypoxic
|
||||||
environments.
|
environments.
|
||||||
Our Greedy Algorithm can help us identify a motif that might be involved in the process.
|
Our Greedy Algorithm can help us identify a motif that might be involved
|
||||||
|
in the process.
|
||||||
|
|
||||||
The transcription factor behind this behaviour is *DosR*, we'll identify the
|
The transcription factor behind this behaviour is *DosR*, we'll identify the
|
||||||
binding sites:
|
binding sites:
|
||||||
|
|
||||||
#+BEGIN_SRC python :results output
|
#+BEGIN_SRC python
|
||||||
def GreedyMotifSearch(Dna, k, t):
|
def GreedyMotifSearch(Dna, k, t):
|
||||||
BestMotifs = []
|
BestMotifs = []
|
||||||
for i in range(0, t):
|
for i in range(0, t):
|
||||||
@ -686,7 +707,6 @@ print(Score(Motifs))
|
|||||||
Our algorithm is pretty fast, but it's not optimal, and that's just a
|
Our algorithm is pretty fast, but it's not optimal, and that's just a
|
||||||
characteristic of Greedy Algorithms, they trade optimality for speed.
|
characteristic of Greedy Algorithms, they trade optimality for speed.
|
||||||
|
|
||||||
|
|
||||||
** Vocabulary
|
** Vocabulary
|
||||||
- k-mer: subsquences of length /k/ in a biological sequence
|
- k-mer: subsquences of length /k/ in a biological sequence
|
||||||
- Frequency map: sequence --> frequency of the sequence
|
- Frequency map: sequence --> frequency of the sequence
|
||||||
|
Loading…
Reference in New Issue
Block a user