Refactor after proof reading

2020-01-26 18:20:25 +01:00 · 2020-01-26 18:20:25 +01:00 · 98cc6ebdb9
commit 98cc6ebdb9
parent 08e92db1eb
1 changed files with 72 additions and 52 deletions
--- a/Notebook.org
+++ b/Notebook.org
@ -1,19 +1,19 @@
 * Biology Meets Programming: Bioinformatics for Beginners
 ** Week 1
 *** DNA replication
 **** Origin of replication (ori)
-Locating an ori is key for gene therapy (e.g. viral vectors), to introduce a theraupetic gene.
+Locating an ori is key for gene therapy (e.g. viral vectors),
 to introduce a theraupetic gene.
 **** Computational approaches to find ori in Vibrio Cholerae
 ***** Exercise: find Pattern
-We'll look for the *DnaA box* sequence, using a sliding window, in that case we will use this function to find out how many times
+We'll look for the *DnaA box* sequence, using a sliding window, in that case
-does a sequence appear in the genome:
+we will use this function to find out how many times
 a sequence appears in the genome:
 #+BEGIN_SRC python
 def PatternCount(Text, Pattern):
@ -25,7 +25,7 @@ def PatternCount(Text, Pattern):
 #+END_SRC
 For the second part, we're going to calculate the frequency map of the sequences
- of length /k/, for that purpose we'll use:
+ of length /k/:
 #+BEGIN_SRC python
 def FrequentWords(Text, k):
@ -52,8 +52,8 @@ def FrequencyMap(Text, k):
 ***** Exercise: Find the reverse complement of a sequence
-We're going to generate the reverse complement of a sequence, which is the complement of a sequence, read in the same direction (5' -> 3').
+We're going to generate the reverse complement of a sequence,
-In this case, we're going to use:
+which is the complement of a sequence, read in the same direction (5' -> 3').
 #+BEGIN_SRC python
 def ReverseComplement(Pattern):
@ -75,12 +75,13 @@ def Complement(Pattern):
    return compl
 #+END_SRC
-After using our function on the /Vibrio Cholerae's/ genome, we realize that some of the frequent /k-mers/ are reverse complements of other frequent ones.
+After using our function on the /Vibrio Cholerae's/ genome, we realize that some
 of the frequent /k-mers/ are reverse complements of other frequent ones.
 ***** Exercise: Find a subsequence within a sequence
-We're going to find the ocurrences of a subsquence inside a sequence, and save the index of the first letter in the sequence.
+We're going to find the ocurrences of a subsquence inside a sequence,
-This time, we'll use:
+and save the index of the first letter in the sequence.
 #+BEGIN_SRC python
 def PatternMatching(Pattern, Genome):
@ -91,15 +92,15 @@ def PatternMatching(Pattern, Genome):
    return positions
 #+END_SRC
-After using our function on the /Vibrio Cholerae's/ genome, we find out that the /9-mers/ with the highest frequency appear in cluster.
+We find out that the /9-mers/ with the highest frequency appear in cluster.
-This is strong statistical evidence that our subsequences are /DnaA boxes/.
+There is strong statistical evidence that our subsequences are /DnaA boxes/.
 **** Computational approaches to find ori in any bacteria
-Now that we're pretty confident about the /DnaA boxes/ sequences that we found, we are going to check if they are a common pattern in the rest of bacterias.
+Now that we're pretty confident about the /DnaA boxes/ sequences that we found,
-We're going to find the ocurrences of the sequences in /Thermotoga petrophila/
+we are going to check if they are a common pattern in the rest of bacterias.
-with:
+We're going to find the ocurrences of the sequences in /Thermotoga petrophila/:
 #+BEGIN_SRC python
 def PatternCount(Text, Pattern):
@ -110,31 +111,37 @@ def PatternCount(Text, Pattern):
    return count
 #+END_SRC
-After the execution, we observe that there are *no* ocurrences of the sequences found in /Vibrio Cholerae/.
+We observe that there are *no* ocurrences of the sequences found in
-We can conclude that different bacterias have different /DnaA boxes/.
+/Vibrio Cholerae/. We can conclude that different bacterias have
 different /DnaA boxes/.
-We have to try another computational approach then, find clusters of /k-mers/ repeated in a small interval.
+We have to try another computational approach,
 find clusters of /k-mers/ repeated in a small interval.
 ** Week 2
 *** DNA replication (II)
 **** Replication process
 The /DNA polymerases/ start replicating while the parent strands are unraveling.
-On the lagging strand, the DNA polymerase waits until the replication fork opens around 2000 nucleotides, and because of that it forms Okazaki fragments.
+On the lagging strand, the DNA polymerase waits until the replication fork
-We need 1 primer for the leading strand and 1 primer per Okazaki fragment for the lagging strand.
+opens around 2000 nucleotides, and because of that it forms Okazaki fragments.
-While the Okazaki fragments are being synthetized, a /DNA ligase/ starts joining the fragments together.
+We need 1 primer for the leading strand and 1 primer per Okazaki fragment
 for the lagging strand. While the Okazaki fragments are being synthetized,
 a /DNA ligase/ starts joining the fragments together.
 **** Computational approach to find ori using deamination
-As the lagging strand is always waiting for the helicase to go forward, the lagging strand is mostly in single-stranded configuration, which is more prone to mutations.
+As the lagging strand is always waiting for the helicase to go forward, the
-One frequent form of mutation is *deamination*, a process that causes cytosine to convert into thymine. This means that cytosine is more frequent in half of the genome.
+lagging strand is mostly in single-stranded configuration,
 which is more prone to mutations. One frequent form of mutation is
 *deamination*,a process that causes cytosine to convert into thymine.
 This means that cytosine is more frequent in half of the genome.
 ***** Exercise: count the ocurrences of cytosine
 We're going to count the ocurrences of the bases in a genome and include them in
-a symbol array, for that purpose we'll use:
+a symbol array.
 #+BEGIN_SRC python
 def SymbolArray(Genome, symbol):
@ -159,7 +166,6 @@ After executing the program, we realize that the algorithm is too inefficient.
 ***** Exercise: find a better algorithm for the previous exercise
 This time, we are going to evaluate an element /i+1/, using the element /i/.
 We'll use the following algorithm:
 #+BEGIN_SRC python
 def FasterSymbolArray(Genome, symbol):
@ -184,18 +190,20 @@ def PatternCount(Text, Pattern):
    return count
 #+END_SRC
-After executing the program we see that it's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/.
+It's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/.
 In /Escherichia Coli/ we plotted the result of our program:
 #+CAPTION: Symbol array for Cytosine in E. Coli Genome]
 [[./Assets/e-coli.png]]
-From that graph, we conclude that ori is located around position 4000000, because that's where the Cytosine concentration is the lowest,
+We can conclude that ori is located around position 4000000,
 because that's where the Cytosine concentration is the lowest,
 which indicates that the region stays single-stranded for the longest time.
 **** The Skew Diagram
-Usually scientists measure the difference between /G - C/, which is *higher on the lagging strand* and *lower on the leading strand*.
+Usually scientists measure the difference between /G - C/, which is
 *higher on the lagging strand* and *lower on the leading strand*.
 ***** Exercise: Synthetize a Skew Array
@ -252,17 +260,27 @@ def SkewArray(Genome):
 **** Finding /DnaA boxes/
-When we look for /DnaA boxes/ in the minimal skew region, we can't find highly repeated /9-mers/ in /Escherichia Coli/.
+When we look for /DnaA boxes/ in the minimal skew region,
-But we find approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide.
+we can't find highly repeated /9-mers/ in /Escherichia Coli/. But we found
 approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide.
 ***** Exercise: Calculate Hamming distance
-The Hamming distance is the number of mismatches between 2 strings, we'll solve this problem in [[./Code/HammingDistance][HammingDistance]]
+The Hamming distance is the number of mismatches between 2 strings.
 #+BEGIN_SRC python
 def HammingDistance(p, q):
    count = 0
    for i in range(0, len(p)):
        if p[i] != q[i]:
            count += 1
    return count
 #+END_SRC
 ***** Exercise: Find approximate patterns
-Now that we have our Hamming distance, we have to find the approximate
+Now that we have our Hamming distance, we use it to find
-sequences:
+the approximate sequences:
 #+BEGIN_SRC python
 def ApproximatePatternMatching(Text, Pattern, d):
@ -283,7 +301,6 @@ def HammingDistance(p, q):
    return count
 #+END_SRC
 ***** Exercise: Count the approximate patterns
 The final part is counting the approximate sequences:
@ -291,8 +308,11 @@ The final part is counting the approximate sequences:
 #+BEGIN_SRC python
 def ApproximatePatternCount(Pattern, Text, d):
    count = 0
-    for i in range(len(Text)-len(Pattern)+1):
+    for i in range(len(Text) - len(Pattern) + 1):
-        if Text[i:i+len(Pattern)] == Pattern or HammingDistance(Text[i:i+len(Pattern)], Pattern) <= d:
+        if (
            Text[i : i + len(Pattern)] == Pattern
            or HammingDistance(Text[i : i + len(Pattern)], Pattern) <= d
        ):
            count += 1
    return count
@ -305,8 +325,8 @@ def HammingDistance(p, q):
    return count
 #+END_SRC
-
+After trying out our ApproximatePatternCount in the hypothesized ori region,
-After trying out our ApproximatePatternCount in the hypothesized ori region, we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/.
+we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/.
 We've finally found a computational method to find ori that seems correct.
 ** Week 3
@ -319,11 +339,11 @@ Variation in gene expression permits the cell to keep track of time.
 ***** Exercise: Find the most common nucleotides in each position
-We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string. In each position, we'll insert the most frequent nucleotide, in upper case,
+We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string.
 In each position, we'll insert the most frequent nucleotide, in upper case,
 and the nucleotide in lower case (if there's no popular one).
-Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most upper case letters.
+Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most
-We'll use a *4 x k* Count Matrix, one row for each base. We'll first generate
+upper case letters. We'll use a *4 x k* Count Matrix, one row for each base.
 the Matrix:
 #+BEGIN_SRC python
 def Count(Motifs):
@ -390,7 +410,7 @@ def Count(Motifs):
    return count
 #+END_SRC
-After obtaining the Consensus string, all we need to do is obtains the total
+After obtaining the Consensus string, all we need to do is obtain the total
 score of our selected /k-mers/:
 #+BEGIN_SRC python
@ -438,8 +458,8 @@ def HammingDistance(p, q):
 ***** Exercise: Find a set of /k-mers/ that minimize the score
-Applying a brute force approach for this task is not viable, we'll use a Greedy Algorithm. For that, we'll first determine the probability
+Applying a brute force approach for this task is not viable, we'll use a Greedy
-of a sequence, we'll use:
+Algorithm. We first have to determine the probability of a sequence:
 #+BEGIN_SRC python
 def Pr(Text, Profile):
@ -563,15 +583,16 @@ def Pr(Text, Profile):
 ***** Motifs in tuberculosis
-Tuberculosis is an infectious disease, cause by a bacteria called /Mycobacterium
+Tuberculosis is an infectious disease, caused by a bacteria called /Mycobacterium
 tuberculosis/. The bacteria can stay latent in the host for decades, in hypoxic
 environments.
-Our Greedy Algorithm can help us identify a motif that might be involved in the process.
+Our Greedy Algorithm can help us identify a motif that might be involved
 in the process.
 The transcription factor behind this behaviour is *DosR*, we'll identify the
 binding sites:
-#+BEGIN_SRC python  :results output
+#+BEGIN_SRC python
 def GreedyMotifSearch(Dna, k, t):
    BestMotifs = []
    for i in range(0, t):
@ -686,7 +707,6 @@ print(Score(Motifs))
 Our algorithm is pretty fast, but it's not optimal, and that's just a
 characteristic of Greedy Algorithms, they trade optimality for speed.
 ** Vocabulary
 - k-mer: subsquences of length /k/ in a biological sequence
 - Frequency map: sequence --> frequency of the sequence