RNA-Seq Analysis in WSL – Part 2 : Raw sequence reads to transcript abundance

After installation of the software needed for performing RNA-Seq analysis, we will now see how to do it with an example.

The RNA-Seq analysis pipeline can be divided into four main parts:

  1. Alignment of raw reads to the genome.
  2. Assembly of the alignments into transcripts of full-length.
  3. Quantification of each gene and transcript
  4. Differential expression analysis between different experimental conditions.

The method decribed below is taken from the following reference:

Pertea, Mihaela, Daehwan Kim, Geo M. Pertea, Jeffrey T. Leek, and Steven L. Salzberg. “Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.” Nature protocols 11, no. 9 (2016): 1650-1667.

We will be following most of the things as per this protocol with some modifications for doing things more easily.

Data

The data provide in the protocol is a subset of a large dataset of raw RNA-Seq reads. These reads map to the chromosome-X of humans. The reads are from two conditions, e.g., wild type or mutant. For each condition, 6 replicates are included.

For more details and better understanding, it is recommended to read the original publication referenced above.

Download data

First, we need to download the data. The data will be downloaded via the linux terminal of WSL.

Start the WSL as administrator. Then make a working directory. This working directory will contain all the files that we download and all the output data that would be generated during the RNA-Seq process.

Bash
cd ~
mkdir work
cd work
mkdir rnaseq_ex1
cd rnaseq_ex1

The work directory is the parent directory in which we can make different sorts of projects. The rnaseq_ex1 directory inside the work directory will be the place where we store all files related to our example RNA-Seq project here.

After the commands above, are inside the working directory. We can download the data.

Bash
wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

This is about 2 GB data, hence, would take some time.

To extract files from this downloaded data execute following command:

Bash
tar xvzf chrX_data.tar.gz chrX_data/

You should be able to see chrX_data created.

Bash
ls -lrth chrX_data
output
total 24K
drwxr-xr-x 2 <username> <username> 4.0K Jan 15 2016 genome
drwxr-xr-x 2 <username> <username> 4.0K Jan 15 2016 indexes
drwxr-xr-x 2 <username> <username> 4.0K Jan 15 2016 samples
-rw-r--r-- 1 <username> <username> 337 Jan 15 2016 geuvadis_phenodata.csv
drwxr-xr-x 2 <username> <username> 4.0K Jul 13 2016 genes
-rw-r--r-- 1 <username> <username> 456 Nov 17 2024 mergelist.txt

The <username> will be your username.

The geuvadis_phenodata.csv file has details about the samples.

The mergelist.txt file along with geuvadis_phenodata.csv will be used during the analysis process.

The subdirectories, genome, indexes, samples, and genes will have other files in them.

The genome directory contains the sequence file (chrX.fa), of chromosome X (GRCh38 build 81). The raw reads will be aligned to this sequence.

The indexes directory contains the HISAT2 indexes for chromosome X.

The genes directroy contains chrX.gtf, which is the human gene annotation for the GRCh38 from the RefSeq database.

The samples directory contains the paired-end RNA-seq reads for 12 samples, with 6 samples for each population, GBR (British from England) and YRA (Yoruba from Ibadan, Nigeria). Each population has 3 samples from male and 3 samples from female individuals. The sequences are in fastq format. Each sample has two files, one for each pair of read.

Let’s start with the first step of alignment of raw sequence reads to the genome.

Alignment

Alignment of the reads can be done using a set of commands one by one for each sample. Although the paper provides a shell script to run all the commands at once, it is overwhelming for people like me who do not know much of shell scripting. The shell script essentially generates the set of commands to be run for all samples and executes them.

For people like me who understand python programming, the same set of commands can be generated and written into a file using python. The output file will be a text file with .sh extension, which can be run as a shell script.

First, we will see what are the set of commands to be run for one sample. But, before that we will setup the output directory and its subdirectories to store the output files created during this process.

Setup output directory

Below are commands for one sample. The outputs will be written in output directory. So, create new directory inside the rnaseq_ex1 directory by typing:

Bash
mkdir output

on the console first.

It is a good practice to put outputs of each tool into their own directories. So, before beginning, we will make four directories inside the output directory. To do that first cd to the output directory. Make sure that you are already in the work/rnaseq_ex1 directory and then execute following commands on terminal.

Bash
cd output
mkdir o_hisat2
mkdir o_samtools
mkdir o_stringtie
mkdir for_ballgown
cd ..

The o_ represents that it is an output directory for ease of reading for the human reader and to separate them from any directories that contain the software code of these software.

Alignment for the first sample, ERR188044

Step 1: Map the reads for each sample to the reference genome

Bash
hisat2 -p 4 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S output/o_hisat2/ERR188044_chrX.sam

The above command used HISAT2 to align the paired end sequence reads of sample ERR188044 to the provided HISAT2 indexes.

  • The -x chrX_data/indexes/chrX_tran specifies the HISAT2 index to align against.
  • -p 4 asks the system to use 4 CPU threads. The original protocol uses -@ 8 but my computer was not able to allocate these many threads, so I used -@ 4.
  • --dta argument produces alignments suitable for downstream software like Stringtie and Cufflinks. In our case we would be using Stringtie. dta stands for downstream-transcriptome-assembly. This option reuqires longer anchor lengths for de-novo discovery of splice sites. This results in fewer short-anchor alignments and thus improved computational efficiency and memory usage.
  • The arguments -1 and -2 are followed by path to the two paired end sequences, formally called as ‘mate 1’ and ‘mate 2’.
  • -S output/o_hisat2/ERR188044_chrX.sam defines the output file path. The output file has .sam extension.

In this case, the .sam output file is about 730 MB.

Step 2: Sort and convert the SAM files to BAM

The .sam alignment file that we get in above step is unsorted with respect to the position on the chromosome. The below command takes an unsorted SAM file, sorts the reads by genomic position, compresses it, and outputs a sorted BAM file.

Bash
samtools sort -@ 4 -o output/o_hisat2/ERR188044_chrX.bam output/o_samtools/ERR188044_chrX.sam
  • samtools sort is the sorting command.
  • -@ 4 means that we are using 4 CPU threads.
  • output/o_samtools/ERR188044_chrX.sam is the input file path.
  • -o output/o_hisat2/ERR188044_chrX.bam defines the output file path.

The output of the command is as follows:

output
[bam_sort_core] merging from 0 files and 4 in-memory blocks...

The .bam output file is about 140 MB.

Step 3: Assemble transcripts using Stringie

Stringtie will use following inputs

  • samtools output, output/o_samtools/ERR188044_chrX.bam
  • reference annotation file chrX_data/genes/chrX.gtf for GRCh38 from the RefSeq database.

Following is the command for assembling the transcript.

Bash
stringtie -p 4 -G chrX_data/genes/chrX.gtf -o output/o_stringtie/ERR188044_chrX.gtf output/o_samtools/ERR188044_chrX.bam -l ERR188044
  • -p 4 means that we are using 4 CPU threads.
  • -G chrX_data/genes/chrX.gtf gives path to the gene annotation file.
  • -o output/o_stringtie/ERR188044_chrX.gtf define the output file path. A .gtf file will be created.
  • output/o_samtools/ERR188044_chrX.bam is the input file.
  • -l ERR188044 sets the prefix for transcript ids in the output gtf file. For example, the transcript ids from this command will have names such as “ERR188044.1” and “ERR188044.2”.

The .gtf file created is about 2 MB.

Above three steps revised

So, till now for one sample we used following commands one by one.

Bash
cd work/rnaseq_ex1/
Bash
hisat2 -p 4 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S output/ERR188044_chrX.sam
Bash
samtools sort -@ 4 -o output/ERR188044_chrX.bam output/ERR188044_chrX.sam
Bash
stringtie -p 4 -G chrX_data/genes/chrX.gtf -o output/ERR188044_chrX.gtf output/ERR188044_chrX.bam -l ERR188044

Running this process for all samples

For each samples, we have to repeat the process. We can either execute all the commands one by one, or write all the commands to a shell script (text file with a .sh extension) and execute it in one go.

The above commands are very similar when run on multiple samples. They will differ only in the place where sample name is written. So, if you have the list of sample names, we can write a small python program to create a shell script and then run that shell script.

Save the python file linked here to work/rnaseq_ex1 directory.

Once this python file is created, run

Bash
python3 rnaseq_shell_1.py

This should write all shell commands you need for performing the above three steps for all samples to the file rnaseq_shell_commands_1.sh.

To run this file, we first have to modify its permissions.

Bash
chmod +x rnaseq_shell_commands_1.sh

Then run

Bash
./rnaseq_shell_commands_1.sh

It took my laptop about 10 minutes to complete running this script for all 12 samples.

The .gtf files for all samples will be created.

Merge transcripts

Now we need to merge the transcripts from all samples into one gtf file. The chrX_data/mergelist.txt file contains the file paths of gtf files created by stringtie. You have saved the output of stringtie in output/o_stringtie directory. So, we have to edit the file to include the full path. Copy the below text and paste into the chrX_data/mergelist.txt file.

text
output/o_stringtie/ERR188044_chrX.gtf
output/o_stringtie/ERR188104_chrX.gtf
output/o_stringtie/ERR188234_chrX.gtf
output/o_stringtie/ERR188245_chrX.gtf
output/o_stringtie/ERR188257_chrX.gtf
output/o_stringtie/ERR188273_chrX.gtf
output/o_stringtie/ERR188337_chrX.gtf
output/o_stringtie/ERR188383_chrX.gtf
output/o_stringtie/ERR188401_chrX.gtf
output/o_stringtie/ERR188428_chrX.gtf
output/o_stringtie/ERR188454_chrX.gtf
output/o_stringtie/ERR204916_chrX.gtf

If we have large number of samples, this modification of files can be automated. However, here we can do it manually.

Now you can run the following command.

Bash
stringtie --merge -p 4 -G chrX_data/genes/chrX.gtf -o output/stringtie_merged/stringtie_merged.gtf chrX_data/mergelist.txt
  • -o output/stringtie_merged/stringtie_merged.gtf defines the path to the merged .gtf file.

This took just few seconds to complete.

Here you do not need to create the strintie_merged directory. It stringtie will create it for you. You can check the result file of this command by using.

Bash
ls output/stringtie_merged/

Estimate transcript abundances

To estimate transcript abundances and table counts for ballgown for one sample ERR188044.

Bash
stringtie -e -B -p 4 -G output/stringtie_merged/stringtie_merged.gtf -o output/for_ballgown/ERR188044_chrX.gtf output/o_samtools/ERR188044_chrX.bam
  • The -e argument estimates the expression only for the transcripts present in the provided GTF.
  • -G output/stringtie_merged/stringtie_merged.gtf uses the merged GTF from previous step.
  • -p 4 makes the system use 4 CPU threads.
  • - B creates folder stucture compatible with ballgown R-package.
  • -o output/for_ballgown/ERR188044_chrX.gtf defines output file path.
  • output/o_samtools/ERR188044_chrX.bam is the input file path for sample ERR188044.

To see the outputs:

Bash
ls -lrth output/for_ballgown/

Running this for all samples

However, again we have to run this command for all samples. So we will create another python script to create all these commands in a bash script. Following is the python script. Save this as rnaseq_shell_2.py in current directory, i.e., ~/work/rnaseq_ex1.

Python
"""
Script to generate bash script for estimating transcript abundances and table counts for ballgown.

"""

# list of sample names
sample_names = ["ERR188044",
                "ERR188104",
                "ERR188234",
                "ERR188245",
                "ERR188257",
                "ERR188273",
                "ERR188337",
                "ERR188383",
                "ERR188401",
                "ERR188428",
                "ERR188454",
                "ERR204916"]


with open('rnaseq_est_trans_abundances.sh', 'w') as f:
    for name in sample_names:
        write_str = f"stringtie -e -B -p 4 -G output/stringtie_merged/stringtie_merged.gtf -o output/		/{name}/{name}_chrX.gtf output/o_samtools/{name}_chrX.bam"
        
        print(write_str, file=f)

To run this python script execute following in console

Bash
python3 rnaseq_shell_2.py

This will create the shell script file, rnaseq_est_trans_abundances.sh. Like we did previously, we need to first change the permissions for it to run.

Bash
chmod +x rnaseq_est_trans_abundances.sh

Then run it. This will take a few minutes to complete.

Bash
./rnaseq_est_trans_abundances.sh

After this gets completed, the files created can be viewed using:

Bash
ls -lrth output/for_ballgown/

This is when we will have generated the estimates of transcript abundances from our RNA-seq sequence data. To perform the differential expression analysis, we will use ballgown, in R environment. Assuming the above steps were done in WSL in Windows system, the differential expression analysis can be performed in R (with Rstudio) which is installed in Windows. We will see this in the next post in this series.