List of front-ends

Front-end is a computer which you use to login to the MetaCentrum and from which you submit the jobs. You should never use it to run the commands unless you are sure they will take few minutes to finish. Each of the front-ends has a default home directory. But of course, you can simply cd to another home directory once you login.

List of front-ends:

Front-end Home directory/storage
skirit.ics.muni.cz /storage/brno2/home/
alfrid.meta.zcu.cz /storage/plzen1/home/
tarkil.grid.cesnet.cz /storage/praha1/home/
nympha.zcu.cz /storage/plzen1/home/
minos.zcu.cz /storage/plzen1/home/
perian.ncbr.muni.cz /storage/brno2/home/
onyx.ncbr.muni.cz /storage/brno2/home/
zuphux.cerit-sc.cz /storage/brno3-cerit/home/

Login to MetaCentrum

Now let's go and login to a front-end.

ssh opplatek@perian.ncbr.muni.cz

What do we have here?

ls

We can also go to a different storage. There is more storages than the home directories and you can access all of them. You should see the quota limits right after you login to the MetaCentrum.

cd /storage/brno9-ceitec/home/opplatek

Setting up the project

We will try to do an (example analysis)[https://github.com/ekg/alignment-and-variant-calling-tutorial] at the MetaCentrum. Now, we would like to start to work with our data. But first, we should create a directory which will hold our project. It is necessary to keep a nice organization of the project otherwise you will get lost soon.

mkdir -p projects/dnaseq/data/raw

This will create all the directories even if some of them (= the parent directories of data directory) do not exist.

Download the project data

Now we can download the data. If we know the files are very small we can download them directly from the front-end. Please, open the ENA link first and check what the data actually are.

Fastq files

cd projects/dnaseq/data/raw wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR177/003/SRR1770413/SRR1770413_1.fastq.gz & wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR177/003/SRR1770413/SRR1770413_2.fastq.gz &

The & character sends a task to the background so we can do more things in one terminal at the same time. To see all the running jobs you can type jobs. You can also "kill" the job in the background by kill %1 where the %1 represents the number which was assigned to a task and which can be seen by the jobs command. But be careful, once you send a job to the background and you close the terminal the jobs will stop! To avoid this you could do disown %1 which will 'disconnect' the jobs from the current terminal and the job will continue running in the background even after closing the terminal.

If the data are big or you have more files than just in this example you should use the regular job submission (see bellow) rather than the front-end.

We will also need a reference genome because we will do an alignment of the data to the reference. Let's make another directory which will hold the reference.

Reference genome

Make a directory to store the reference sequences (genomes).

mkdir -p /storage/brno9-ceitec/home/opplatek/projects/dnaseq/data/references cd /storage/brno9-ceitec/home/opplatek/projects/dnaseq/data/references

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

Directories to save the scripts

Make a directory to store the scripts once we know how to process the data. mkdir /storage/brno9-ceitec/home/opplatek/projects/dnaseq/src

Starting a job at MetaCentrum

Once we have the data and the reference we can start with the processing. We will request an interactive job. This is suitable if we want to calculate something quick or we want to test our commands.

Job request/submission

qsub -l walltime=1:0:0 -l select=1:ncpus=6:mem=4gb:scratch_local=10gb -I

We ask for 1 hours job, 6 CPUs, 4 GB of RAM and 10 GB of storage space. The -I tells to the submission system we want the interactive job. To get the command used above you can visit the PBSPro 'assembler'.

Once the jobs is started (this will take some time) we can continue with the calculation.

It is best if you specify the variables at the beginning. We will specificy the input and output directories and the reference we will use for the analysis.

Variables

INPUT_DIR=/storage/brno9-ceitec/home/opplatek/projects/dnaseq/data/raw OUTPUT_DIR=/storage/brno9-ceitec/home/opplatek/projects/dnaseq/results REFERENCE=/storage/brno9-ceitec/home/opplatek/projects/dnaseq/data/references/GCF_000005845.2_ASM584v2_genomic.fna.gz

Now, it is important to copy all the files to the assigned working directory which is automatically assigned to a variable $SCRATCH.

Copy inputs

cp $INPUT_DIR/*.gz $SCRATCH/ & cp $REFERENCE $SCRATCH/ wait

The wait command waits until everything what is running in the background finishes and then it allows you to move on. This is suitable if you send a lot of jobs in the background but you don't want to follow each and every one of them if they finished or not.

And we can move to the $SCRATCH

cd $SCRATCH

Quality check

The first step in the experiment should be always initial quality check. FastQC is one of the nice tools for this purpose. If we want to use it at the MetaCentrum, we first need to add a module with this tool. The list of all tools installed at the MetaCentrum is here.

To add the module execute following command:

module add fastQC-0.11.5

MetaCentrum often has several versions of a single tool so be sure you always use the 'full' name including the version.

Once the FastQC is added we can analyze our fastq files. But first, we create and output directory for the results to make the $SCRATCH clean.

mkdir -p $SCRATCH/qc/fastqc/raw fastqc -t $PBS_NUM_PPN -o $SCRATCH/qc/fastqc/raw *fastq.gz

This will exectute FastQC on all the files (*) which has a .gz suffix. The * character is so called regular expression and represents any number of characters (both numbers and letters). The $PBS_NUM_PPN is a special variable assigned by MetaCentrum to each of the job submission (similar to $SCRATCH variable which is also unique for each job) and holds and information about the number of assigned CPUs (cores).This way we will use all the cores which were assigned to a job. Using the full path to the output directory will ensure you save the results exactly where you want to save it.

Once the analysis is finished we can summarize the FastQC results (we have two) using the MultiQC tool. Again, look for MultiQC at the MetaCentrum. MultiQC is a Python package and therefore is 'hidden' in not so nice module name called python36-modules-gcc. We can add the whole module and launch MultiQC.

module add python36-modules-gcc multiqc -o $SCRATCH/qc/fastqc/raw $SCRATCH/qc/fastqc/raw/

In case you will get an error saying 'This system supports the C.UTF-8 locale which is recommended...' you have to change the local settings by

export LC_ALL=C.UTF-8 export LANG=C.UTF-8

But how to visualize the results from the MultiQC? It a html report but it is stored somewhere at the MetaCentrum storage. We can use scp command to copy the files from a remote server to our computer. To do this we have to specify what we want to copy and where.

Copy results from the server (for an easy visualization)

scp -r $SCRATCH/qc/fastqc/raw opplatek@wolf01.ncbr.muni.cz:/home/opplatek/Desktop

The -r tells the scp (as well as if we would use simple cp to copy within the server) to copy a whole directory. If we would not use -r we would get and error. Then, we tell it to copy to the wolf cluster/computer (here) and a directory where we want to copy it to. Once it's done you can check the Desktop of your Wolf computer and see if the results are there.

If you are not at a computer which has an alias it is easier to finish the job, copy the results to storage and then copy the from your computer as for example:

scp -r opplatek@perian.ncbr.muni.cz:/storage/brno9-ceitec/home/opplatek/projects/dnaseq/results/qc/fastqc/raw .

This command would copy the directory at /storage/brno9-ceitec/home/opplatek/projects/dnaseq/results/qc/fastqc/raw to the current directory at your computer (.).

The next step in the analysis could be the raw read preprocessing but we will not do it now. We go straight to the alignment.

Alignment

There are many tools for the alignment. For a DNA experiment, BWA MEM is often used.

Again, add the module first.

module add bwa-0.7.15

BWA MEM (and most of the other aligners) need to create an reference index first. It 'prepares' the reference for an easier access by the aligner. However, BWA MEM cannot work with gz files. We have to uncompress the reference first.

If you remember, we stored the name of the reference in variable $REFERENCE and copied it to the MetaCentrum. But can we work directly with this variable? Let's see where it's pointing at.

echo $REFERENCE

It is pointing to the original location of the reference which is at the storage and we don't want to use it. We would like to use the one we have copied to the $SCRATCH. We can use a simple trick to get just the name of the reference and store it in the variable. For this, we can use command basename.

REFERENCE=$(basename $REFERENCE) echo $REFERENCE

This executes the command basename $REFERENCE and stores the result (the brackets) again to the same variable name.

But still, the reference is in a gz format. We have to uncompress it and 'adjust' the variable name.

gunzip $SCRATCH/$REFERENCE REFERENCE=${REFERENCE%.gz} echo $REFERENCE

The command with 'curly' brackets in this form removes the '.gz' part of the name of the reference stored at $REFERENCE variable.

Now we can finally make the index.

bwa index $SCRATCH/$REFERENCE

But how to align the files to the reference without writing the names (so we can make it automaci next time)? We can use a for loop and our replacement with curly brackets. for loop goes through all the files/variables and stops only when it went through all the options.

mkdir $SCRATCH/alignment

for READ1 in *_1.fastq.gz do READ1=$READ1 # Redundant, doesn't have to be here at all READ2=${READ1%_1.fastq.gz}_2.fastq.gz echo $READ1 echo $READ2 bwa mem -t $PBS_NUM_PPN $SCRATCH/$REFERENCE $SCRATCH/$READ1 $SCRATCH/$READ2 > $SCRATCH/alignment/${READ1%_1.fastq.gz}.sam done

It is a good practice to use tab as an indentation of the commands in a loop. It is much clearer for reading. However, this notebook style of tutorials has some troubles with it so we were not able to use it. So what we did is we scanned the current directory for all files ending with _1.fastq.gz and stored it to $READ1 variable. Then, we removed the _1.fastq.gz from the name and appended _2.fastq.gz which we then stored as $READ2. Then, we aligned the $READ1 and $READ2 to the reference $REFERENCE and stored the output is a sam file.

SAM->BAM, sorting and indexing

The sam format is a text file which can be further compressed into a bam (binary) format. This can be done using Samtools. As usual, find Samtools at MetaCentrum and add module.

module add samtools-1.6

Now we can do the conversion and we can also see some basic alignment statistics.

mkdir $SCRATCH/qc/flagstat cd $SCRATCH/alignment/

for file in *.sam do samtools view -@ $PBS_NUM_PPN -b $file > ${file%.sam}.bam samtools flagstat ${file%.sam}.bam > $SCRATCH/qc/flagstat/${file%.sam}.flagstat.txt rm $file # We don't need the sam file anymore done

As you can see, Samtools is not using -t or --threads to specify number of threads to the calculation but -@ character. Always check the tools manual for the correct setting of each parameter!.

We also often need to sort the alignment before further processing and index the sorted bam file.

for file in *.bam do samtools sort -@ $PBS_NUM_PPN -o ${file%.bam}.sorted.bam $file samtools index ${file%.bam}.sorted.bam rm $file # We don't need the unsorted bam file anymore done

Copy the results back to the storage

We should not forget we are doing the analysis at the $SCRATCH. We have to copy the results back to our storage otherwise they would be deleted in few hours/days.

Copy the results back to the storage

mkdir -p $OUTPUT_DIR cp -r $SCRATCH/qc $OUTPUT_DIR/ & cp -r $SCRATCH/alignment $OUTPUT_DIR/ & wait

And it's a good thing to clean after ourselves after the analysis is done and exit the job.

rm -r $SCRATCH/* exit

Finishing the analysis

Hurray, you have just aligned your first experiment!

Visualization

We can also visualize the alignment using IGV. This tool is better to use directly at Wolf computers.

Download the aligned and sorted bam file, and the reference genome to the computer here (you can try it on your own). Then, open a new terminal and launch the visualization.

scp /storage/brno9-ceitec/home/opplatek/projects/dnaseq/data/references/*.gz opplatek@wolf01.ncbr.muni.cz:/scratch/opplatek/ scp -r /storage/brno9-ceitec/home/opplatek/projects/dnaseq/results opplatek@wolf01.ncbr.muni.cz:/scratch/opplatek/

We can logout from MetaCentrum or we can just open a new terminal.

Before we continue with the visualization we have to decompress (gunzip) the reference and index it, this time with samtools.

On wolf computers decompress and index the reference.

First, go to the directory and decompress.

cd /scratch/opplatek gunzip GCF_000005845.2_ASM584v2_genomic.fna.gz

And index with samtools

module add samtools:1.2 samtools faidx GCF_000005845.2_ASM584v2_genomic.fna

Load IGV and launch it.

module add igv:2.3.60 igv.sh

For visualization, please see IGV manual but you will need the reference genome in fasta format and the aligned bam file and bai index file (in the same directory as the bam file).

You can also download IGV to your own computer and run it directly. But be careful, IGV is quite "sensitive" on available RAM memory. A lot of loaded BAM files will jam your computer.

Post-alignmnet QC

We should always check the quality of our alignment before any further processing. One of the tools to get summarized statistics is Qualimap.

We can download it here (it's a Java tool without a need for installation) and run the QC.

wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.2.1.zip unzip qualimap_v2.2.1.zip

Then go to the newly created folder (you can also move it to a directory where you will save the 'installed' tools) and run the analysis

cd qualimap_v2.2.1/ ./qualimap -h

We see that Qualimap has quite a lot of different modules for the QC. We will use bamqc.

/home/opplatek/qualimap_v2.2.1/qualimap bamqc -bam /scratch/opplatek/results/alignment/SRR1770413.sorted.bam -nt 2

Note: You could run this in a loop to get all the results at once. The results are now stored in the same directory as the input BAM file. They can be also put in a separate directory using -outdir flag in Qualimap.

Running a regular job

Once you have tested all the commands you can put them to a script and submit the whole workflow as a 'regular' job. You can either create the script file at your computer and then upload it to the MetaCentrum

scp opplatek@wolf01.ncbr.muni.cz:/scratch/opplatek/dnaseq.sh opplatek@perian.ncbr.muni.cz:/storage/brno9-ceitec/home/opplatek/projects/dnaseq/src/

You have to be logged in at the Wolf computer or your computer, not at the MetaCentrum.

Or you can create it right away at the MetaCentrum.

cd /storage/brno9-ceitec/home/opplatek/projects/dnaseq/src nano dnaseq.sh

and insert the commands you want the job to perform (all the way from specifying variables to copying the results back and cleaning the $SCRATCH. Then, by pressing ctrl+o you can save the file to the same file as you specified with nano command or you can rename the output file.

Once you have all the commands you can submit the job by 'regular' job submission.

cd /storage/brno9-ceitec/home/opplatek/projects/dnaseq/src qsub -l walltime=1:0:0 -l select=1:ncpus=6:mem=4gb:scratch_local=10gb dnaseq.sh

which should submit the job. You can close the terminal and come back within an hour and hope your results will be copied to the output directory you specified in the script.

To see the queued, running, finished or failed jobs you can visit your MetaCentrum account and go for list of jobs. You should see all the jobs that finished in past 24/48 hours.

Downloading a bunch of files

If you have more than just two sequencing files I recommend you to downloaded them using the same job submission just explained. You won't have to wait until all is downloaded as the databases might be very slow sometimes.

I can recommend to make a download.sh script in /storage/brno9-ceitec/home/opplatek/projects/dnaseq/src with following lines

OUTPUT_DIR=/storage/brno9-cerit/home/opplatek/projects/myprojectname cd $SCRATCH/ wget ftp://...file1.gz & wget ftp://...file2.gz & wget ftp://...file3.gz & wget ftp://...file4.gz & wait mkdir -p $OUTPUT_DIR cp $SCRATCH/*.gz $OUTPUT_DIR rm -r $SCRATCH/*

And the submit it with

cd /storage/brno9-ceitec/home/opplatek/projects/dnaseq/src qsub -l walltime=24:0:0 -l select=1:ncpus=2:mem=60gb:scratch_local=100gb download.sh

More advanced job submissions and job control

Of course, this is just a simple example. For more sophisticated job submission, script writing and analysis please see MetaCentrum wikipedia listed on the top.

Downloading the results to a Windows computer

For information about accessing the MetaCentrum from a windows computer please see Bi5444_Analysis_of_sequencing_data_lesson_04_unix_intro.pdf in the study materials of the course. You will need putty to work in a terminal and WinSCP to download the data. They are quite simple to use so don't worry.