Refactor after proof reading

2020-01-26 18:20:25 +01:00 · 2020-01-26 18:20:25 +01:00 · 98cc6ebdb9
commit 98cc6ebdb9
parent 08e92db1eb
1 changed files with 72 additions and 52 deletions
--- a/Notebook.org
+++ b/Notebook.org
@ -1,19 +1,19 @@
 * Biology Meets Programming: Bioinformatics for Beginners
-
 ** Week 1
-  
 *** DNA replication

 **** Origin of replication (ori)

-Locating an ori is key for gene therapy (e.g. viral vectors), to introduce a theraupetic gene.
+Locating an ori is key for gene therapy (e.g. viral vectors),
+to introduce a theraupetic gene.
     
 **** Computational approaches to find ori in Vibrio Cholerae
     
 ***** Exercise: find Pattern
     
-We'll look for the *DnaA box* sequence, using a sliding window, in that case we will use this function to find out how many times
-does a sequence appear in the genome:
+We'll look for the *DnaA box* sequence, using a sliding window, in that case
+we will use this function to find out how many times
+a sequence appears in the genome:

 #+BEGIN_SRC python
 def PatternCount(Text, Pattern):
@ -25,7 +25,7 @@ def PatternCount(Text, Pattern):
 #+END_SRC

 For the second part, we're going to calculate the frequency map of the sequences
- of length /k/, for that purpose we'll use:
+ of length /k/:

 #+BEGIN_SRC python
 def FrequentWords(Text, k):
@ -52,8 +52,8 @@ def FrequencyMap(Text, k):

 ***** Exercise: Find the reverse complement of a sequence
     
-We're going to generate the reverse complement of a sequence, which is the complement of a sequence, read in the same direction (5' -> 3').
-In this case, we're going to use:
+We're going to generate the reverse complement of a sequence,
+which is the complement of a sequence, read in the same direction (5' -> 3').

 #+BEGIN_SRC python
 def ReverseComplement(Pattern):
@ -75,12 +75,13 @@ def Complement(Pattern):
    return compl
 #+END_SRC

-After using our function on the /Vibrio Cholerae's/ genome, we realize that some of the frequent /k-mers/ are reverse complements of other frequent ones.
+After using our function on the /Vibrio Cholerae's/ genome, we realize that some
+of the frequent /k-mers/ are reverse complements of other frequent ones.
     
 ***** Exercise: Find a subsequence within a sequence
      
-We're going to find the ocurrences of a subsquence inside a sequence, and save the index of the first letter in the sequence.
-This time, we'll use:
+We're going to find the ocurrences of a subsquence inside a sequence,
+and save the index of the first letter in the sequence.

 #+BEGIN_SRC python
 def PatternMatching(Pattern, Genome):
@ -91,15 +92,15 @@ def PatternMatching(Pattern, Genome):
    return positions
 #+END_SRC

-After using our function on the /Vibrio Cholerae's/ genome, we find out that the /9-mers/ with the highest frequency appear in cluster.
-This is strong statistical evidence that our subsequences are /DnaA boxes/.
+We find out that the /9-mers/ with the highest frequency appear in cluster.
+There is strong statistical evidence that our subsequences are /DnaA boxes/.

      
 **** Computational approaches to find ori in any bacteria
     
-Now that we're pretty confident about the /DnaA boxes/ sequences that we found, we are going to check if they are a common pattern in the rest of bacterias.
-We're going to find the ocurrences of the sequences in /Thermotoga petrophila/
-with:
+Now that we're pretty confident about the /DnaA boxes/ sequences that we found,
+we are going to check if they are a common pattern in the rest of bacterias.
+We're going to find the ocurrences of the sequences in /Thermotoga petrophila/:

 #+BEGIN_SRC python
 def PatternCount(Text, Pattern):
@ -110,31 +111,37 @@ def PatternCount(Text, Pattern):
    return count
 #+END_SRC

-After the execution, we observe that there are *no* ocurrences of the sequences found in /Vibrio Cholerae/.
-We can conclude that different bacterias have different /DnaA boxes/.
+We observe that there are *no* ocurrences of the sequences found in
+/Vibrio Cholerae/. We can conclude that different bacterias have
+different /DnaA boxes/.

-We have to try another computational approach then, find clusters of /k-mers/ repeated in a small interval.
+We have to try another computational approach,
+find clusters of /k-mers/ repeated in a small interval.

 ** Week 2
-   
 *** DNA replication (II)

 **** Replication process

 The /DNA polymerases/ start replicating while the parent strands are unraveling.
-On the lagging strand, the DNA polymerase waits until the replication fork opens around 2000 nucleotides, and because of that it forms Okazaki fragments.
-We need 1 primer for the leading strand and 1 primer per Okazaki fragment for the lagging strand.
-While the Okazaki fragments are being synthetized, a /DNA ligase/ starts joining the fragments together.
+On the lagging strand, the DNA polymerase waits until the replication fork
+opens around 2000 nucleotides, and because of that it forms Okazaki fragments.
+We need 1 primer for the leading strand and 1 primer per Okazaki fragment
+for the lagging strand. While the Okazaki fragments are being synthetized,
+a /DNA ligase/ starts joining the fragments together.

 **** Computational approach to find ori using deamination
     
-As the lagging strand is always waiting for the helicase to go forward, the lagging strand is mostly in single-stranded configuration, which is more prone to mutations.
-One frequent form of mutation is *deamination*, a process that causes cytosine to convert into thymine. This means that cytosine is more frequent in half of the genome.
+As the lagging strand is always waiting for the helicase to go forward, the
+lagging strand is mostly in single-stranded configuration,
+which is more prone to mutations. One frequent form of mutation is
+*deamination*,a process that causes cytosine to convert into thymine.
+This means that cytosine is more frequent in half of the genome.

 ***** Exercise: count the ocurrences of cytosine
      
 We're going to count the ocurrences of the bases in a genome and include them in
-a symbol array, for that purpose we'll use:
+a symbol array.

 #+BEGIN_SRC python
 def SymbolArray(Genome, symbol):
@ -159,7 +166,6 @@ After executing the program, we realize that the algorithm is too inefficient.
 ***** Exercise: find a better algorithm for the previous exercise

 This time, we are going to evaluate an element /i+1/, using the element /i/.
-We'll use the following algorithm:

 #+BEGIN_SRC python
 def FasterSymbolArray(Genome, symbol):
@ -184,18 +190,20 @@ def PatternCount(Text, Pattern):
    return count
 #+END_SRC

-After executing the program we see that it's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/.
+It's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/.
 In /Escherichia Coli/ we plotted the result of our program:

 #+CAPTION: Symbol array for Cytosine in E. Coli Genome]
 [[./Assets/e-coli.png]]

-From that graph, we conclude that ori is located around position 4000000, because that's where the Cytosine concentration is the lowest,
+We can conclude that ori is located around position 4000000,
+because that's where the Cytosine concentration is the lowest,
 which indicates that the region stays single-stranded for the longest time.
      
 **** The Skew Diagram

-Usually scientists measure the difference between /G - C/, which is *higher on the lagging strand* and *lower on the leading strand*.
+Usually scientists measure the difference between /G - C/, which is
+*higher on the lagging strand* and *lower on the leading strand*.
     
 ***** Exercise: Synthetize a Skew Array
      
@ -252,17 +260,27 @@ def SkewArray(Genome):

 **** Finding /DnaA boxes/
     
-When we look for /DnaA boxes/ in the minimal skew region, we can't find highly repeated /9-mers/ in /Escherichia Coli/.
-But we find approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide.
+When we look for /DnaA boxes/ in the minimal skew region,
+we can't find highly repeated /9-mers/ in /Escherichia Coli/. But we found
+approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide.

 ***** Exercise: Calculate Hamming distance
      
-The Hamming distance is the number of mismatches between 2 strings, we'll solve this problem in [[./Code/HammingDistance][HammingDistance]]
+The Hamming distance is the number of mismatches between 2 strings.
+
+#+BEGIN_SRC python
+def HammingDistance(p, q):
+    count = 0
+    for i in range(0, len(p)):
+        if p[i] != q[i]:
+            count += 1
+    return count
+#+END_SRC

 ***** Exercise: Find approximate patterns
      
-Now that we have our Hamming distance, we have to find the approximate
-sequences:
+Now that we have our Hamming distance, we use it to find
+the approximate sequences:

 #+BEGIN_SRC python
 def ApproximatePatternMatching(Text, Pattern, d):
@ -283,7 +301,6 @@ def HammingDistance(p, q):
    return count
 #+END_SRC

-
 ***** Exercise: Count the approximate patterns

 The final part is counting the approximate sequences:
@ -291,8 +308,11 @@ The final part is counting the approximate sequences:
 #+BEGIN_SRC python
 def ApproximatePatternCount(Pattern, Text, d):
    count = 0
-    for i in range(len(Text)-len(Pattern)+1):
-        if Text[i:i+len(Pattern)] == Pattern or HammingDistance(Text[i:i+len(Pattern)], Pattern) <= d:
+    for i in range(len(Text) - len(Pattern) + 1):
+        if (
+            Text[i : i + len(Pattern)] == Pattern
+            or HammingDistance(Text[i : i + len(Pattern)], Pattern) <= d
+        ):
            count += 1
    return count

@ -305,8 +325,8 @@ def HammingDistance(p, q):
    return count
 #+END_SRC

-
-After trying out our ApproximatePatternCount in the hypothesized ori region, we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/.
+After trying out our ApproximatePatternCount in the hypothesized ori region,
+we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/.
 We've finally found a computational method to find ori that seems correct.

 ** Week 3
@ -319,11 +339,11 @@ Variation in gene expression permits the cell to keep track of time.

 ***** Exercise: Find the most common nucleotides in each position

-We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string. In each position, we'll insert the most frequent nucleotide, in upper case,
+We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string.
+In each position, we'll insert the most frequent nucleotide, in upper case,
 and the nucleotide in lower case (if there's no popular one).
-Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most upper case letters.
-We'll use a *4 x k* Count Matrix, one row for each base. We'll first generate
-the Matrix:
+Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most
+upper case letters. We'll use a *4 x k* Count Matrix, one row for each base.

 #+BEGIN_SRC python
 def Count(Motifs):
@ -390,7 +410,7 @@ def Count(Motifs):
    return count
 #+END_SRC

-After obtaining the Consensus string, all we need to do is obtains the total
+After obtaining the Consensus string, all we need to do is obtain the total
 score of our selected /k-mers/:

 #+BEGIN_SRC python
@ -438,8 +458,8 @@ def HammingDistance(p, q):
 ***** Exercise: Find a set of /k-mers/ that minimize the score


-Applying a brute force approach for this task is not viable, we'll use a Greedy Algorithm. For that, we'll first determine the probability
-of a sequence, we'll use:
+Applying a brute force approach for this task is not viable, we'll use a Greedy
+Algorithm. We first have to determine the probability of a sequence:
    
 #+BEGIN_SRC python
 def Pr(Text, Profile):
@ -563,15 +583,16 @@ def Pr(Text, Profile):

 ***** Motifs in tuberculosis

-Tuberculosis is an infectious disease, cause by a bacteria called /Mycobacterium
+Tuberculosis is an infectious disease, caused by a bacteria called /Mycobacterium
 tuberculosis/. The bacteria can stay latent in the host for decades, in hypoxic
 environments.
-Our Greedy Algorithm can help us identify a motif that might be involved in the process.
+Our Greedy Algorithm can help us identify a motif that might be involved
+in the process.

 The transcription factor behind this behaviour is *DosR*, we'll identify the
 binding sites:

-#+BEGIN_SRC python  :results output
+#+BEGIN_SRC python
 def GreedyMotifSearch(Dna, k, t):
    BestMotifs = []
    for i in range(0, t):
@ -686,7 +707,6 @@ print(Score(Motifs))
 Our algorithm is pretty fast, but it's not optimal, and that's just a
 characteristic of Greedy Algorithms, they trade optimality for speed.

-
 ** Vocabulary
 - k-mer: subsquences of length /k/ in a biological sequence
 - Frequency map: sequence --> frequency of the sequence