1 Commits

Author SHA1 Message Date
6561aa2caf Convert org mode README to markdown 2021-05-05 12:19:40 +02:00
2 changed files with 54 additions and 52 deletions

View File

@@ -1,68 +1,75 @@
# locigenesis # locigenesis
locigenesis is a tool that generates a human T-cell receptor (TCR), runs locigenesis is a tool that generates a human T-cell receptor (TCR), runs it through a sequence reader simulation tool and extracts CDR3.
it through a sequence reader simulation tool and extracts CDR3.
The goal of this project is to generate both HVR sequences with and The goal of this project is to generate both HVR sequences with and without sequencing errors, in order to create datasets for a Machine Learning algorithm.
without sequencing errors, in order to create datasets for a Machine
Learning algorithm.
<a id="orgb4db211"></a>
## Technologies ## Technologies
- [immuneSIM](https://github.com/GreiffLab/immuneSIM/): in silico - [immuneSIM](https://github.com/GreiffLab/immuneSIM/): in silico generation of human and mouse BCR and TCR repertoires
generation of human and mouse BCR and TCR repertoires - [CuReSim](http://www.pegase-biosciences.com/curesim-a-customized-read-simulator/): read simulator that mimics Ion Torrent sequencing
- [CuReSim](http://www.pegase-biosciences.com/curesim-a-customized-read-simulator/):
read simulator that mimics Ion Torrent sequencing
<a id="orgace1e30"></a>
## Installation ## Installation
This project uses [Nix](https://nixos.org/) to ensure reproducible This project uses [Nix](https://nixos.org/) to ensure reproducible builds.
builds.
1. Install Nix (compatible with MacOS, Linux and 1. Install Nix (compatible with MacOS, Linux and [WSL](https://docs.microsoft.com/en-us/windows/wsl/about)):
[WSL](https://docs.microsoft.com/en-us/windows/wsl/about)):
```bash
curl -L https://nixos.org/nix/install | sh curl -L https://nixos.org/nix/install | sh
```
2. Clone the repository: 1. Clone the repository:
```bash
git clone https://git.coolneng.duckdns.org/coolneng/locigenesis git clone https://git.coolneng.duckdns.org/coolneng/locigenesis
```
3. Change the working directory to the project: 1. Change the working directory to the project:
```bash
cd locigenesis cd locigenesis
```
4. Enter the nix-shell: 1. Enter the nix-shell:
```bash
nix-shell nix-shell
```
After running these commands, you will find yourself in a shell that After running these commands, you will find yourself in a shell that contains all the needed dependencies.
contains all the needed dependencies.
<a id="org531bad5"></a>
## Usage ## Usage
An execution script that accepts 2 parameters is provided, the following An execution script that accepts 2 parameters is provided, the following command invokes it:
command invokes it:
```bash
./generation.sh <number of sequences> <number of reads> ./generation.sh <number of sequences> <number of reads>
```
- \<number of sequences\>: an integer that specifies the number of - <number of sequences>: an integer that specifies the number of different sequences to generate
different sequences to generate - <number of reads>: an integer that specifies the number of reads to perform on each sequence
- \<number of reads\>: an integer that specifies the number of reads
to perform on each sequence
The script will generate 2 files under the data directory: The script will generate 2 files under the data directory:
|HVR.fastq | curesim-HVR.fastq | <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
|:----:|:-----:|
|Contains the original CDR3 sequence|Contains CDR3 after the read simulation, with sequencing errors |
<colgroup>
<col class="org-left" />
<col class="org-left" />
</colgroup>
<tbody>
<tr>
<td class="org-left">HVR.fastq</td>
<td class="org-left">Contains the original CDR3 sequence</td>
</tr>
<tr>
<td class="org-left">CuReSim-HVR.fastq</td>
<td class="org-left">Contains CDR3 after the read simulation, with sequencing errors</td>
</tr>
</tbody>
</table>

View File

@@ -34,11 +34,7 @@ parse_metadata <- function(metadata) {
#' @return A \code{character} containing the gene sequence #' @return A \code{character} containing the gene sequence
match_id_sequence <- function(names, vdj_segments, id) { match_id_sequence <- function(names, vdj_segments, id) {
matches <- grep(names, pattern = id) matches <- grep(names, pattern = id)
if(id == "TRBJ2-2"){
row <- matches[2]
} else {
row <- matches[1] row <- matches[1]
}
return(as.character(vdj_segments[row])) return(as.character(vdj_segments[row]))
} }
@@ -110,9 +106,8 @@ get_cys_coordinates <- function(alignment) {
insertion <- unlist(Biostrings::insertion(alignment)) insertion <- unlist(Biostrings::insertion(alignment))
deletion <- unlist(Biostrings::deletion(alignment)) deletion <- unlist(Biostrings::deletion(alignment))
delta_coordinates <- handle_indels(insertion, deletion, cys, alignment) delta_coordinates <- handle_indels(insertion, deletion, cys, alignment)
read_start <- unlist(start(Biostrings::Views(alignment))) cys_start <- cys$start + delta_coordinates$start
cys_start <- cys$start + delta_coordinates$start + read_start - 1 cys_end <- cys$end + delta_coordinates$end
cys_end <- cys$end + delta_coordinates$end + read_start
return(list("start" = cys_start, "end" = cys_end)) return(list("start" = cys_start, "end" = cys_end))
} }
@@ -131,7 +126,7 @@ get_hvr_sequences <- function(sequences, vdj_segments, cores = detectCores()) {
) )
cys_coordinates <- parallel::mclapply(v_alignment, FUN = get_cys_coordinates) cys_coordinates <- parallel::mclapply(v_alignment, FUN = get_cys_coordinates)
cys_df <- as.data.frame(do.call(rbind, cys_coordinates)) cys_df <- as.data.frame(do.call(rbind, cys_coordinates))
remaining <- Biostrings::subseq(sequences, start = unlist(cys_df$end) + 1) remaining <- Biostrings::subseq(sequences, start = unlist(cys_df$end))
j_alignment <- parallel::mcmapply(remaining, j_alignment <- parallel::mcmapply(remaining,
df$j_seq, df$j_seq,
FUN = align_sequence, FUN = align_sequence,
@@ -150,4 +145,4 @@ get_hvr_sequences <- function(sequences, vdj_segments, cores = detectCores()) {
data <- parse_data(file = "data/curesim_sequence.fastq") data <- parse_data(file = "data/curesim_sequence.fastq")
hvr <- get_hvr_sequences(sequences = data[[1]], vdj_segments = data[[2]]) hvr <- get_hvr_sequences(sequences = data[[1]], vdj_segments = data[[2]])
Biostrings::writeXStringSet(hvr, "data/curesim-HVR.fastq", format = "fastq") Biostrings::writeXStringSet(hvr, "data/CuReSim-HVR.fastq", format = "fastq")