diff --git a/Notebook.org b/Notebook.org index 95eeb26..5c47339 100644 --- a/Notebook.org +++ b/Notebook.org @@ -1,19 +1,19 @@ * Biology Meets Programming: Bioinformatics for Beginners - -** Week 1 - +** Week 1 *** DNA replication **** Origin of replication (ori) -Locating an ori is key for gene therapy (e.g. viral vectors), to introduce a theraupetic gene. +Locating an ori is key for gene therapy (e.g. viral vectors), +to introduce a theraupetic gene. **** Computational approaches to find ori in Vibrio Cholerae ***** Exercise: find Pattern -We'll look for the *DnaA box* sequence, using a sliding window, in that case we will use this function to find out how many times -does a sequence appear in the genome: +We'll look for the *DnaA box* sequence, using a sliding window, in that case +we will use this function to find out how many times +a sequence appears in the genome: #+BEGIN_SRC python def PatternCount(Text, Pattern): @@ -25,7 +25,7 @@ def PatternCount(Text, Pattern): #+END_SRC For the second part, we're going to calculate the frequency map of the sequences - of length /k/, for that purpose we'll use: + of length /k/: #+BEGIN_SRC python def FrequentWords(Text, k): @@ -52,8 +52,8 @@ def FrequencyMap(Text, k): ***** Exercise: Find the reverse complement of a sequence -We're going to generate the reverse complement of a sequence, which is the complement of a sequence, read in the same direction (5' -> 3'). -In this case, we're going to use: +We're going to generate the reverse complement of a sequence, +which is the complement of a sequence, read in the same direction (5' -> 3'). #+BEGIN_SRC python def ReverseComplement(Pattern): @@ -75,12 +75,13 @@ def Complement(Pattern): return compl #+END_SRC -After using our function on the /Vibrio Cholerae's/ genome, we realize that some of the frequent /k-mers/ are reverse complements of other frequent ones. +After using our function on the /Vibrio Cholerae's/ genome, we realize that some +of the frequent /k-mers/ are reverse complements of other frequent ones. ***** Exercise: Find a subsequence within a sequence -We're going to find the ocurrences of a subsquence inside a sequence, and save the index of the first letter in the sequence. -This time, we'll use: +We're going to find the ocurrences of a subsquence inside a sequence, +and save the index of the first letter in the sequence. #+BEGIN_SRC python def PatternMatching(Pattern, Genome): @@ -91,15 +92,15 @@ def PatternMatching(Pattern, Genome): return positions #+END_SRC -After using our function on the /Vibrio Cholerae's/ genome, we find out that the /9-mers/ with the highest frequency appear in cluster. -This is strong statistical evidence that our subsequences are /DnaA boxes/. +We find out that the /9-mers/ with the highest frequency appear in cluster. +There is strong statistical evidence that our subsequences are /DnaA boxes/. **** Computational approaches to find ori in any bacteria -Now that we're pretty confident about the /DnaA boxes/ sequences that we found, we are going to check if they are a common pattern in the rest of bacterias. -We're going to find the ocurrences of the sequences in /Thermotoga petrophila/ -with: +Now that we're pretty confident about the /DnaA boxes/ sequences that we found, +we are going to check if they are a common pattern in the rest of bacterias. +We're going to find the ocurrences of the sequences in /Thermotoga petrophila/: #+BEGIN_SRC python def PatternCount(Text, Pattern): @@ -110,31 +111,37 @@ def PatternCount(Text, Pattern): return count #+END_SRC -After the execution, we observe that there are *no* ocurrences of the sequences found in /Vibrio Cholerae/. -We can conclude that different bacterias have different /DnaA boxes/. +We observe that there are *no* ocurrences of the sequences found in +/Vibrio Cholerae/. We can conclude that different bacterias have +different /DnaA boxes/. -We have to try another computational approach then, find clusters of /k-mers/ repeated in a small interval. +We have to try another computational approach, +find clusters of /k-mers/ repeated in a small interval. ** Week 2 - *** DNA replication (II) **** Replication process The /DNA polymerases/ start replicating while the parent strands are unraveling. -On the lagging strand, the DNA polymerase waits until the replication fork opens around 2000 nucleotides, and because of that it forms Okazaki fragments. -We need 1 primer for the leading strand and 1 primer per Okazaki fragment for the lagging strand. -While the Okazaki fragments are being synthetized, a /DNA ligase/ starts joining the fragments together. +On the lagging strand, the DNA polymerase waits until the replication fork +opens around 2000 nucleotides, and because of that it forms Okazaki fragments. +We need 1 primer for the leading strand and 1 primer per Okazaki fragment +for the lagging strand. While the Okazaki fragments are being synthetized, +a /DNA ligase/ starts joining the fragments together. **** Computational approach to find ori using deamination -As the lagging strand is always waiting for the helicase to go forward, the lagging strand is mostly in single-stranded configuration, which is more prone to mutations. -One frequent form of mutation is *deamination*, a process that causes cytosine to convert into thymine. This means that cytosine is more frequent in half of the genome. +As the lagging strand is always waiting for the helicase to go forward, the +lagging strand is mostly in single-stranded configuration, +which is more prone to mutations. One frequent form of mutation is +*deamination*,a process that causes cytosine to convert into thymine. +This means that cytosine is more frequent in half of the genome. ***** Exercise: count the ocurrences of cytosine We're going to count the ocurrences of the bases in a genome and include them in -a symbol array, for that purpose we'll use: +a symbol array. #+BEGIN_SRC python def SymbolArray(Genome, symbol): @@ -159,7 +166,6 @@ After executing the program, we realize that the algorithm is too inefficient. ***** Exercise: find a better algorithm for the previous exercise This time, we are going to evaluate an element /i+1/, using the element /i/. -We'll use the following algorithm: #+BEGIN_SRC python def FasterSymbolArray(Genome, symbol): @@ -184,18 +190,20 @@ def PatternCount(Text, Pattern): return count #+END_SRC -After executing the program we see that it's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/. +It's a viable algorithm, with a complexity of /O(n)/ instead of the previous /O(n²)/. In /Escherichia Coli/ we plotted the result of our program: #+CAPTION: Symbol array for Cytosine in E. Coli Genome] [[./Assets/e-coli.png]] -From that graph, we conclude that ori is located around position 4000000, because that's where the Cytosine concentration is the lowest, +We can conclude that ori is located around position 4000000, +because that's where the Cytosine concentration is the lowest, which indicates that the region stays single-stranded for the longest time. **** The Skew Diagram -Usually scientists measure the difference between /G - C/, which is *higher on the lagging strand* and *lower on the leading strand*. +Usually scientists measure the difference between /G - C/, which is +*higher on the lagging strand* and *lower on the leading strand*. ***** Exercise: Synthetize a Skew Array @@ -252,17 +260,27 @@ def SkewArray(Genome): **** Finding /DnaA boxes/ -When we look for /DnaA boxes/ in the minimal skew region, we can't find highly repeated /9-mers/ in /Escherichia Coli/. -But we find approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide. +When we look for /DnaA boxes/ in the minimal skew region, +we can't find highly repeated /9-mers/ in /Escherichia Coli/. But we found +approximate sequences that are similar to our /9-mers/ and only differ in 1 nucleotide. ***** Exercise: Calculate Hamming distance -The Hamming distance is the number of mismatches between 2 strings, we'll solve this problem in [[./Code/HammingDistance][HammingDistance]] +The Hamming distance is the number of mismatches between 2 strings. + +#+BEGIN_SRC python +def HammingDistance(p, q): + count = 0 + for i in range(0, len(p)): + if p[i] != q[i]: + count += 1 + return count +#+END_SRC ***** Exercise: Find approximate patterns -Now that we have our Hamming distance, we have to find the approximate -sequences: +Now that we have our Hamming distance, we use it to find +the approximate sequences: #+BEGIN_SRC python def ApproximatePatternMatching(Text, Pattern, d): @@ -283,7 +301,6 @@ def HammingDistance(p, q): return count #+END_SRC - ***** Exercise: Count the approximate patterns The final part is counting the approximate sequences: @@ -291,8 +308,11 @@ The final part is counting the approximate sequences: #+BEGIN_SRC python def ApproximatePatternCount(Pattern, Text, d): count = 0 - for i in range(len(Text)-len(Pattern)+1): - if Text[i:i+len(Pattern)] == Pattern or HammingDistance(Text[i:i+len(Pattern)], Pattern) <= d: + for i in range(len(Text) - len(Pattern) + 1): + if ( + Text[i : i + len(Pattern)] == Pattern + or HammingDistance(Text[i : i + len(Pattern)], Pattern) <= d + ): count += 1 return count @@ -305,8 +325,8 @@ def HammingDistance(p, q): return count #+END_SRC - -After trying out our ApproximatePatternCount in the hypothesized ori region, we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/. +After trying out our ApproximatePatternCount in the hypothesized ori region, +we find a frequent /k-mer/ with its reverse complement in /Escherichia Coli/. We've finally found a computational method to find ori that seems correct. ** Week 3 @@ -319,11 +339,11 @@ Variation in gene expression permits the cell to keep track of time. ***** Exercise: Find the most common nucleotides in each position -We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string. In each position, we'll insert the most frequent nucleotide, in upper case, +We are going to create a *t x k* Motif Matrix, where *t* is the /k-mer/ string. +In each position, we'll insert the most frequent nucleotide, in upper case, and the nucleotide in lower case (if there's no popular one). -Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most upper case letters. -We'll use a *4 x k* Count Matrix, one row for each base. We'll first generate -the Matrix: +Our goal is to select the *most* conserved Matrix, i.e. the Matrix with the most +upper case letters. We'll use a *4 x k* Count Matrix, one row for each base. #+BEGIN_SRC python def Count(Motifs): @@ -390,7 +410,7 @@ def Count(Motifs): return count #+END_SRC -After obtaining the Consensus string, all we need to do is obtains the total +After obtaining the Consensus string, all we need to do is obtain the total score of our selected /k-mers/: #+BEGIN_SRC python @@ -438,8 +458,8 @@ def HammingDistance(p, q): ***** Exercise: Find a set of /k-mers/ that minimize the score -Applying a brute force approach for this task is not viable, we'll use a Greedy Algorithm. For that, we'll first determine the probability -of a sequence, we'll use: +Applying a brute force approach for this task is not viable, we'll use a Greedy +Algorithm. We first have to determine the probability of a sequence: #+BEGIN_SRC python def Pr(Text, Profile): @@ -563,15 +583,16 @@ def Pr(Text, Profile): ***** Motifs in tuberculosis -Tuberculosis is an infectious disease, cause by a bacteria called /Mycobacterium +Tuberculosis is an infectious disease, caused by a bacteria called /Mycobacterium tuberculosis/. The bacteria can stay latent in the host for decades, in hypoxic environments. -Our Greedy Algorithm can help us identify a motif that might be involved in the process. +Our Greedy Algorithm can help us identify a motif that might be involved +in the process. The transcription factor behind this behaviour is *DosR*, we'll identify the binding sites: -#+BEGIN_SRC python :results output +#+BEGIN_SRC python def GreedyMotifSearch(Dna, k, t): BestMotifs = [] for i in range(0, t): @@ -686,7 +707,6 @@ print(Score(Motifs)) Our algorithm is pretty fast, but it's not optimal, and that's just a characteristic of Greedy Algorithms, they trade optimality for speed. - ** Vocabulary - k-mer: subsquences of length /k/ in a biological sequence - Frequency map: sequence --> frequency of the sequence