Refactor after proof reading

This commit is contained in:
coolneng 2020-01-26 18:20:25 +01:00
parent 08e92db1eb
commit 98cc6ebdb9
Signed by: coolneng
GPG Key ID: 9893DA236405AF57

View File

@ -1,19 +1,19 @@
* Biology Meets Programming: Bioinformatics for Beginners * Biology Meets Programming: Bioinformatics for Beginners
** Week 1 ** Week 1
*** DNA replication *** DNA replication
**** Origin of replication (ori) **** Origin of replication (ori)
Locating an ori is key for gene therapy (e.g. viral vectors), to introduce a theraupetic gene. Locating an ori is key for gene therapy (e.g. viral vectors),
to introduce a theraupetic gene.
**** Computational approaches to find ori in Vibrio Cholerae **** Computational approaches to find ori in Vibrio Cholerae
***** Exercise: find Pattern ***** Exercise: find Pattern
We'll look for the *DnaA box* sequence, using a sliding window, in that case we will use this function to find out how many times We'll look for the *DnaA box* sequence, using a sliding window, in that case
does a sequence appear in the genome: we will use this function to find out how many times
a sequence appears in the genome:
#+BEGIN_SRC python #+BEGIN_SRC python
def PatternCount(Text, Pattern): def PatternCount(Text, Pattern):
@ -25,7 +25,7 @@ def PatternCount(Text, Pattern):
#+END_SRC #+END_SRC
For the second part, we're going to calculate the frequency map of the sequences For the second part, we're going to calculate the frequency map of the sequences
of length /k/, for that purpose we'll use: of length /k/:
#+BEGIN_SRC python #+BEGIN_SRC python
def FrequentWords(Text, k): def FrequentWords(Text, k):
@ -52,8 +52,8 @@ def FrequencyMap(Text, k):
***** Exercise: Find the reverse complement of a sequence ***** Exercise: Find the reverse complement of a sequence
We're going to generate the reverse complement of a sequence, which is the complement of a sequence, read in the same direction (5' -> 3'). We're going to generate the reverse complement of a sequence,
In this case, we're going to use: which is the complement of a sequence, read in the same direction (5' -> 3').
#+BEGIN_SRC python #+BEGIN_SRC python
def ReverseComplement(Pattern): def ReverseComplement(Pattern):
@ -75,12 +75,13 @@ def Complement(Pattern):
return compl return compl
#+END_SRC #+END_SRC
After using our function on the /Vibrio Cholerae's/ genome, we realize that some of the frequent /k-mers/ are reverse complements of other frequent ones. After using our function on the /Vibrio Cholerae's/ genome, we realize that some
of the frequent /k-mers/ are reverse complements of other frequent ones.
***** Exercise: Find a subsequence within a sequence ***** Exercise: Find a subsequence within a sequence
We're going to find the ocurrences of a subsquence inside a sequence, and save the index of the first letter in the sequence. We're going to find the ocurrences of a subsquence inside a sequence,
This time, we'll use: and save the index of the first letter in the sequence.
#+BEGIN_SRC python #+BEGIN_SRC python
def PatternMatching(Pattern, Genome): def PatternMatching(Pattern, Genome):
@ -91,15 +92,15 @@ def PatternMatching(Pattern, Genome):
return positions return positions
#+END_SRC #+END_SRC
After using our function on the /Vibrio Cholerae's/ genome, we find out that the /9-mers/ with the highest frequency appear in cluster. We find out that the /9-mers/ with the highest frequency appear in cluster.
This is strong statistical evidence that our subsequences are /DnaA boxes/. There is strong statistical evidence that our subsequences are /DnaA boxes/.
**** Computational approaches to find ori in any bacteria **** Computational approaches to find ori in any bacteria
Now that we're pretty confident about the /DnaA boxes/ sequences that we found, we are going to check if they are a common pattern in the rest of bacterias. Now that we're pretty confident about the /DnaA boxes/ sequences that we found,
We're going to find the ocurrences of the sequences in /Thermotoga petrophila/ we are going to check if they are a common pattern in the rest of bacterias.
with: We're going to find the ocurrences of the sequences in /Thermotoga petrophila/:
#+BEGIN_SRC python #+BEGIN_SRC python
def PatternCount(Text, Pattern): def PatternCount(Text, Pattern):
@ -110,31 +111,37 @@ def PatternCount(Text, Pattern):
return count return count
#+END_SRC #+END_SRC
After the execution, we observe that there are *no* ocurrences of the sequences found in /Vibrio Cholerae/. We observe that there are *no* ocurrences of the sequences found in
We can conclude that different bacterias have different /DnaA boxes/. /Vibrio Cholerae/. We can conclude that different bacterias have
different /DnaA boxes/.
We have to try another computational approach then, find clusters of /k-mers/ repeated in a small interval. We have to try another computational approach,
find clusters of /k-mers/ repeated in a small interval.
** Week 2 ** Week 2
*** DNA replication (II) *** DNA replication (II)
**** Replication process **** Replication process
The /DNA polymerases/ start replicating while the parent strands are unraveling. The /DNA polymerases/ start replicating while the parent strands are unraveling.
On the lagging strand, the DNA polymerase waits until the replication fork opens around 2000 nucleotides, and because of that it forms Okazaki fragments. On the lagging strand, the DNA polymerase waits until the replication fork
We need 1 primer for the leading strand and 1 primer per Okazaki fragment for the lagging strand. opens around 2000 nucleotides, and because of that it forms Okazaki fragments.
While the Okazaki fragments are being synthetized, a /DNA ligase/ starts joining the fragments together. We need 1 primer for the leading strand and 1 primer per Okazaki fragment
for the lagging strand. While the Okazaki fragments are being synthetized,
a /DNA ligase/ starts joining the fragments together.
**** Computational approach to find ori using deamination **** Computational approach to find ori using deamination
As the lagging strand is always waiting for the helicase to go forward, the lagging strand is mostly in single-stranded configuration, which is more prone to mutations. As the lagging strand is always waiting for the helicase to go forward, the
One frequent form of mutation is *deamination*, a process that causes cytosine to convert into thymine. This means that cytosine is more frequent in half of the genome. lagging strand is mostly in single-stranded configuration,
which is more prone to mutations. One frequent form of mutation is
*deamination*,a process that causes cytosine to convert into thymine.
This means that cytosine is more frequent in half of the genome.
***** Exercise: count the ocurrences of cytosine ***** Exercise: count the ocurrences of cytosine
We're going to count the ocurrences of the bases in a genome and include them in We're going to count the ocurrences of the bases in a genome and include them in
a symbol array, for that purpose we'll use: a symbol array.
#+BEGIN_SRC python #+BEGIN_SRC python
def SymbolArray(Genome, symbol): def SymbolArray(Genome, symbol):
@ -159,7 +166,6 @@ After executing the program, we realize that the algorithm is too inefficient.
***** Exercise: find a better algorithm for the previous exercise ***** Exercise: find a better algorithm for the previous exercise
This time, we are going to evaluate an element /i+1/, using the element /i/. This time, we are going to evaluate an element /i+1/, using the element /i/.
We'll use the following algorithm:
#+BEGIN_SRC python #+BEGIN_SRC python
def FasterSymbolArray(Genome, symbol): def FasterSymbolArray(Genome, symbol):
@ -184,18 +190,20 @@ def PatternCount(Text, Pattern):
return count return count
#+END_SRC #+END_SRC
After executing the program we see that it's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/. It's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/.
In /Escherichia Coli/ we plotted the result of our program: In /Escherichia Coli/ we plotted the result of our program:
#+CAPTION: Symbol array for Cytosine in E. Coli Genome] #+CAPTION: Symbol array for Cytosine in E. Coli Genome]
[[./Assets/e-coli.png]] [[./Assets/e-coli.png]]
From that graph, we conclude that ori is located around position 4000000, because that's where the Cytosine concentration is the lowest, We can conclude that ori is located around position 4000000,
because that's where the Cytosine concentration is the lowest,
which indicates that the region stays single-stranded for the longest time. which indicates that the region stays single-stranded for the longest time.
**** The Skew Diagram **** The Skew Diagram
Usually scientists measure the difference between /G - C/, which is *higher on the lagging strand* and *lower on the leading strand*. Usually scientists measure the difference between /G - C/, which is
*higher on the lagging strand* and *lower on the leading strand*.
***** Exercise: Synthetize a Skew Array ***** Exercise: Synthetize a Skew Array
@ -252,17 +260,27 @@ def SkewArray(Genome):
**** Finding /DnaA boxes/ **** Finding /DnaA boxes/
When we look for /DnaA boxes/ in the minimal skew region, we can't find highly repeated /9-mers/ in /Escherichia Coli/. When we look for /DnaA boxes/ in the minimal skew region,
But we find approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide. we can't find highly repeated /9-mers/ in /Escherichia Coli/. But we found
approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide.
***** Exercise: Calculate Hamming distance ***** Exercise: Calculate Hamming distance
The Hamming distance is the number of mismatches between 2 strings, we'll solve this problem in [[./Code/HammingDistance][HammingDistance]] The Hamming distance is the number of mismatches between 2 strings.
#+BEGIN_SRC python
def HammingDistance(p, q):
count = 0
for i in range(0, len(p)):
if p[i] != q[i]:
count += 1
return count
#+END_SRC
***** Exercise: Find approximate patterns ***** Exercise: Find approximate patterns
Now that we have our Hamming distance, we have to find the approximate Now that we have our Hamming distance, we use it to find
sequences: the approximate sequences:
#+BEGIN_SRC python #+BEGIN_SRC python
def ApproximatePatternMatching(Text, Pattern, d): def ApproximatePatternMatching(Text, Pattern, d):
@ -283,7 +301,6 @@ def HammingDistance(p, q):
return count return count
#+END_SRC #+END_SRC
***** Exercise: Count the approximate patterns ***** Exercise: Count the approximate patterns
The final part is counting the approximate sequences: The final part is counting the approximate sequences:
@ -291,8 +308,11 @@ The final part is counting the approximate sequences:
#+BEGIN_SRC python #+BEGIN_SRC python
def ApproximatePatternCount(Pattern, Text, d): def ApproximatePatternCount(Pattern, Text, d):
count = 0 count = 0
for i in range(len(Text)-len(Pattern)+1): for i in range(len(Text) - len(Pattern) + 1):
if Text[i:i+len(Pattern)] == Pattern or HammingDistance(Text[i:i+len(Pattern)], Pattern) <= d: if (
Text[i : i + len(Pattern)] == Pattern
or HammingDistance(Text[i : i + len(Pattern)], Pattern) <= d
):
count += 1 count += 1
return count return count
@ -305,8 +325,8 @@ def HammingDistance(p, q):
return count return count
#+END_SRC #+END_SRC
After trying out our ApproximatePatternCount in the hypothesized ori region,
After trying out our ApproximatePatternCount in the hypothesized ori region, we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/. we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/.
We've finally found a computational method to find ori that seems correct. We've finally found a computational method to find ori that seems correct.
** Week 3 ** Week 3
@ -319,11 +339,11 @@ Variation in gene expression permits the cell to keep track of time.
***** Exercise: Find the most common nucleotides in each position ***** Exercise: Find the most common nucleotides in each position
We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string. In each position, we'll insert the most frequent nucleotide, in upper case, We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string.
In each position, we'll insert the most frequent nucleotide, in upper case,
and the nucleotide in lower case (if there's no popular one). and the nucleotide in lower case (if there's no popular one).
Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most upper case letters. Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most
We'll use a *4 x k* Count Matrix, one row for each base. We'll first generate upper case letters. We'll use a *4 x k* Count Matrix, one row for each base.
the Matrix:
#+BEGIN_SRC python #+BEGIN_SRC python
def Count(Motifs): def Count(Motifs):
@ -390,7 +410,7 @@ def Count(Motifs):
return count return count
#+END_SRC #+END_SRC
After obtaining the Consensus string, all we need to do is obtains the total After obtaining the Consensus string, all we need to do is obtain the total
score of our selected /k-mers/: score of our selected /k-mers/:
#+BEGIN_SRC python #+BEGIN_SRC python
@ -438,8 +458,8 @@ def HammingDistance(p, q):
***** Exercise: Find a set of /k-mers/ that minimize the score ***** Exercise: Find a set of /k-mers/ that minimize the score
Applying a brute force approach for this task is not viable, we'll use a Greedy Algorithm. For that, we'll first determine the probability Applying a brute force approach for this task is not viable, we'll use a Greedy
of a sequence, we'll use: Algorithm. We first have to determine the probability of a sequence:
#+BEGIN_SRC python #+BEGIN_SRC python
def Pr(Text, Profile): def Pr(Text, Profile):
@ -563,15 +583,16 @@ def Pr(Text, Profile):
***** Motifs in tuberculosis ***** Motifs in tuberculosis
Tuberculosis is an infectious disease, cause by a bacteria called /Mycobacterium Tuberculosis is an infectious disease, caused by a bacteria called /Mycobacterium
tuberculosis/. The bacteria can stay latent in the host for decades, in hypoxic tuberculosis/. The bacteria can stay latent in the host for decades, in hypoxic
environments. environments.
Our Greedy Algorithm can help us identify a motif that might be involved in the process. Our Greedy Algorithm can help us identify a motif that might be involved
in the process.
The transcription factor behind this behaviour is *DosR*, we'll identify the The transcription factor behind this behaviour is *DosR*, we'll identify the
binding sites: binding sites:
#+BEGIN_SRC python :results output #+BEGIN_SRC python
def GreedyMotifSearch(Dna, k, t): def GreedyMotifSearch(Dna, k, t):
BestMotifs = [] BestMotifs = []
for i in range(0, t): for i in range(0, t):
@ -686,7 +707,6 @@ print(Score(Motifs))
Our algorithm is pretty fast, but it's not optimal, and that's just a Our algorithm is pretty fast, but it's not optimal, and that's just a
characteristic of Greedy Algorithms, they trade optimality for speed. characteristic of Greedy Algorithms, they trade optimality for speed.
** Vocabulary ** Vocabulary
- k-mer: subsquences of length /k/ in a biological sequence - k-mer: subsquences of length /k/ in a biological sequence
- Frequency map: sequence --> frequency of the sequence - Frequency map: sequence --> frequency of the sequence