{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Basics of MetaCentrum job submissions\n", "## MetaCentrum information links\n", "### Read before using\n", "\n", "https://wiki.metacentrum.cz/wiki/Kategorie:For_beginners \n", "https://wiki.metacentrum.cz/wiki/How_to_compute \n", "https://wiki.metacentrum.cz/wiki/How_to_compute/Quick_start \n", "https://wiki.metacentrum.cz/wiki/How_to_compute/Interactive_jobs \n", "https://wiki.metacentrum.cz/wiki/How_to_compute/Job_management \n", "https://wiki.metacentrum.cz/wiki/Working_with_data \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## List of front-ends\n", "[Front-end](https://wiki.metacentrum.cz/wiki/Frontend) is a computer which you use to login to the MetaCentrum and from which you submit the jobs. You should never use it to **run** the commands unless you are sure they will take few minutes to finish. Each of the front-ends has a default home directory. But of course, you can simply `cd` to another home directory once you login.\n", "\n", "List of front-ends:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Front-end | Home directory/storage |\n", "|-----------------------|-----------------------------|\n", "|skirit.ics.muni.cz | /storage/brno2/home/ |\n", "|alfrid.meta.zcu.cz | /storage/plzen1/home/ |\n", "|tarkil.grid.cesnet.cz | /storage/praha1/home/ | \n", "|nympha.zcu.cz | /storage/plzen1/home/ |\n", "|minos.zcu.cz | /storage/plzen1/home/ |\n", "|perian.ncbr.muni.cz | /storage/brno2/home/ |\n", "|onyx.ncbr.muni.cz | /storage/brno2/home/ |\n", "|zuphux.cerit-sc.cz | /storage/brno3-cerit/home/ | " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Login to MetaCentrum\n", "\n", "Now let's go and login to a front-end.\n", "\n", "`ssh opplatek@perian.ncbr.muni.cz`\n", "\n", "What do we have here?\n", "\n", "`ls`\n", "\n", "We can also go to a different [storage](https://wiki.metacentrum.cz/wiki/File_systems_in_MetaCentrum). There is more storages than the home directories and you can access all of them. You should see the quota limits right after you login to the MetaCentrum.\n", "\n", "`cd /storage/brno9-ceitec/home/opplatek`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting up the project\n", "\n", "We will try to do an (example analysis)[https://github.com/ekg/alignment-and-variant-calling-tutorial] at the MetaCentrum. Now, we would like to start to work with our data. But first, we should create a directory which will hold our project. It is necessary to keep a nice [organization](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510) of the project otherwise you will get lost soon.\n", "\n", "`mkdir -p projects/dnaseq/data/raw`\n", "\n", "This will create all the directories even if some of them (= the parent directories of `data` directory) do not exist.\n", "\n", "### Download the project data\n", "Now we can download the [data](https://www.ebi.ac.uk/ena/data/view/SRR1770413). If we know the files are very small we can download them directly from the front-end. Please, open the ENA link first and check what the data actually are. \n", "\n", "#### Fastq files\n", "`cd projects/dnaseq/data/raw\n", "wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR177/003/SRR1770413/SRR1770413_1.fastq.gz &\n", "wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR177/003/SRR1770413/SRR1770413_2.fastq.gz &`\n", "\n", "The `&` character sends a task to the background so we can do more things in one terminal at the same time. To see all the running jobs you can type `jobs`. You can also \"kill\" the job in the background by `kill %1` where the `%1` represents the number which was assigned to a task and which can be seen by the `jobs` command. But be careful, once you send a job to the background and you close the terminal the jobs will stop! To avoid this you could do `disown %1` which will 'disconnect' the jobs from the current terminal and the job will continue running in the background even after closing the terminal.\n", "\n", "If the data are big or you have more files than just in this example you should use the regular job submission (see bellow) rather than the front-end.\n", "\n", "We will also need a reference genome because we will do an alignment of the data to the reference. Let's make another directory which will hold the reference. \n", "\n", "#### Reference genome\n", "Make a directory to store the reference sequences (genomes).\n", "\n", "`mkdir -p /storage/brno9-ceitec/home/opplatek/projects/dnaseq/data/references\n", "cd /storage/brno9-ceitec/home/opplatek/projects/dnaseq/data/references`\n", "\n", "`wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz`\n", "\n", "#### Directories to save the scripts\n", "Make a directory to store the scripts once we know how to process the data. \n", "`mkdir /storage/brno9-ceitec/home/opplatek/projects/dnaseq/src`\n", "\n", "### Starting a job at MetaCentrum\n", "Once we have the data and the reference we can start with the processing. We will request an [interactive job](https://wiki.metacentrum.cz/wiki/How_to_compute/Interactive_jobs). This is suitable if we want to calculate something quick or we want to test our commands.\n", "\n", "#### Job request/submission\n", "`qsub -l walltime=1:0:0 -l select=1:ncpus=6:mem=4gb:scratch_local=10gb -I`\n", "\n", "We ask for 1 hours job, 6 CPUs, 4 GB of RAM and 10 GB of storage space. The `-I` tells to the submission system we want the interactive job. To get the command used above you can visit the [PBSPro 'assembler'](https://metavo.metacentrum.cz/pbsmon2/qsub_pbspro).\n", "\n", "Once the jobs is started (this will take some time) we can continue with the calculation.\n", "\n", "It is best if you specify the variables at the beginning. We will specificy the input and output directories and the reference we will use for the analysis.\n", "\n", "#### Variables\n", "`INPUT_DIR=/storage/brno9-ceitec/home/opplatek/projects/dnaseq/data/raw\n", "OUTPUT_DIR=/storage/brno9-ceitec/home/opplatek/projects/dnaseq/results\n", "REFERENCE=/storage/brno9-ceitec/home/opplatek/projects/dnaseq/data/references/GCF_000005845.2_ASM584v2_genomic.fna.gz`\n", "\n", "Now, it is important to copy all the files to the assigned working directory which is automatically assigned to a variable `$SCRATCH`. \n", "\n", "#### Copy inputs\n", "`cp $INPUT_DIR/*.gz $SCRATCH/ &\n", "cp $REFERENCE $SCRATCH/\n", "wait`\n", "\n", "The `wait` command waits until everything what is running in the background finishes and then it allows you to move on. This is suitable if you send a lot of jobs in the background but you don't want to follow each and every one of them if they finished or not. \n", "\n", "And we can move to the `$SCRATCH`\n", "\n", "`cd $SCRATCH`\n", "\n", "#### Quality check\n", "The first step in the experiment should be always initial quality check. FastQC is one of the nice tools for this purpose. If we want to use it at the MetaCentrum, we first need to add a module with this tool. The list of all tools installed at the MetaCentrum is [here](https://wiki.metacentrum.cz/wiki/MetaCentrum_Application_List).\n", "\n", "To add the module execute following command:\n", "\n", "`module add fastQC-0.11.5`\n", "\n", "MetaCentrum often has several versions of a single tool so be sure you always use the 'full' name including the version. \n", "\n", "Once the FastQC is added we can analyze our fastq files. But first, we create and output directory for the results to make the `$SCRATCH` clean.\n", "\n", "`mkdir -p $SCRATCH/qc/fastqc/raw\n", "fastqc -t $PBS_NUM_PPN -o $SCRATCH/qc/fastqc/raw *fastq.gz`\n", "\n", "This will exectute FastQC on all the files (`*`) which has a `.gz` suffix. The `*` character is so called regular expression and represents any number of characters (both numbers and letters). The `$PBS_NUM_PPN` is a special variable assigned by MetaCentrum to each of the job submission (similar to `$SCRATCH` variable which is also unique for each job) and holds and information about the number of assigned CPUs (cores).This way we will use all the cores which were assigned to a job. Using the full path to the output directory will ensure you save the results exactly where you want to save it. \n", "\n", "Once the analysis is finished we can summarize the FastQC results (we have two) using the MultiQC tool. Again, look for [MultiQC](https://wiki.metacentrum.cz/wiki/Python_-_modules) at the MetaCentrum. MultiQC is a Python package and therefore is 'hidden' in not so nice module name called `python36-modules-gcc`. We can add the whole module and launch MultiQC. \n", "\n", "`module add python36-modules-gcc\n", "multiqc -o $SCRATCH/qc/fastqc/raw $SCRATCH/qc/fastqc/raw/`\n", "\n", "In case you will get an error saying 'This system supports the C.UTF-8 locale which is recommended...' you have to change the local settings by\n", "\n", "`export LC_ALL=C.UTF-8\n", "export LANG=C.UTF-8`\n", "\n", "But how to visualize the results from the MultiQC? It a html report but it is stored somewhere at the MetaCentrum storage. We can use `scp` command to copy the files from a remote server to our computer. To do this we have to specify what we want to copy and where. \n", "\n", "#### Copy results from the server (for an easy visualization)\n", "`scp -r $SCRATCH/qc/fastqc/raw opplatek@wolf01.ncbr.muni.cz:/home/opplatek/Desktop`\n", "\n", "The `-r` tells the `scp` (as well as if we would use simple `cp` to copy within the server) to copy a whole directory. If we would not use `-r` we would get and error. Then, we tell it to copy to the `wolf` cluster/computer (here) and a directory where we want to copy it to. Once it's done you can check the `Desktop` of your Wolf computer and see if the results are there. \n", "\n", "If you are not at a computer which has an **alias** it is easier to finish the job, copy the results to `storage` and then copy the from **your computer** as for example:\n", "\n", "`scp -r opplatek@perian.ncbr.muni.cz:/storage/brno9-ceitec/home/opplatek/projects/dnaseq/results/qc/fastqc/raw .`\n", "\n", "This command would copy the directory at `/storage/brno9-ceitec/home/opplatek/projects/dnaseq/results/qc/fastqc/raw` to the current directory at your computer (`.`).\n", "\n", "The next step in the analysis could be the raw **read preprocessing** but we will not do it now. We go straight to the **alignment**.\n", "\n", "#### Alignment\n", "There are many [tools](https://www.ebi.ac.uk/~nf/hts_mappers/) for the alignment. For a DNA experiment, `BWA MEM` is often used. \n", "\n", "Again, add the [module](https://wiki.metacentrum.cz/wiki/BWA) first. \n", "\n", "`module add bwa-0.7.15`\n", "\n", "`BWA MEM` (and most of the other aligners) need to create an **reference index** first. It 'prepares' the reference for an easier access by the aligner. However, `BWA MEM` cannot work with `gz` files. We have to uncompress the reference first.\n", "\n", "If you remember, we stored the name of the reference in variable `$REFERENCE` and copied it to the MetaCentrum. But can we work directly with this variable? Let's see where it's pointing at. \n", "\n", "`echo $REFERENCE`\n", "\n", "It is pointing to the original location of the reference which is at the storage and we don't want to use it. We would like to use the one we have copied to the `$SCRATCH`. We can use a simple trick to get just the name of the reference and store it in the variable. For this, we can use command `basename`.\n", "\n", "`REFERENCE=$(basename $REFERENCE)\n", "echo $REFERENCE`\n", "\n", "This executes the command `basename $REFERENCE` and stores the result (the brackets) again to the same variable name.\n", "\n", "But still, the reference is in a `gz` format. We have to uncompress it and 'adjust' the variable name. \n", "\n", "`gunzip $SCRATCH/$REFERENCE\n", "REFERENCE=${REFERENCE%.gz}\n", "echo $REFERENCE`\n", "\n", "The command with 'curly' brackets in this form removes the '.gz' part of the name of the reference stored at `$REFERENCE` variable.\n", "\n", "Now we can finally make the index.\n", "\n", "`bwa index $SCRATCH/$REFERENCE`\n", "\n", "But how to align the files to the reference without writing the names (so we can make it automaci next time)? We can use a **for** loop and our replacement with curly brackets. `for` loop goes through all the files/variables and stops only when it went through all the options. \n", "\n", "`mkdir $SCRATCH/alignment`\n", "\n", "\n", "`for READ1 in *_1.fastq.gz\n", "do\n", "READ1=$READ1 # Redundant, doesn't have to be here at all\n", "READ2=${READ1%_1.fastq.gz}_2.fastq.gz\n", "echo $READ1\n", "echo $READ2\n", "bwa mem -t $PBS_NUM_PPN $SCRATCH/$REFERENCE $SCRATCH/$READ1 $SCRATCH/$READ2 > $SCRATCH/alignment/${READ1%_1.fastq.gz}.sam\n", "done`\n", "\n", "It is a good practice to use `tab` as an indentation of the commands in a loop. It is much clearer for reading. However, this notebook style of tutorials has some troubles with it so we were not able to use it. \n", "So what we did is we scanned the current directory for all files ending with `_1.fastq.gz` and stored it to `$READ1` variable. Then, we removed the `_1.fastq.gz` from the name and appended `_2.fastq.gz` which we then stored as `$READ2`. Then, we aligned the `$READ1` and `$READ2` to the reference `$REFERENCE` and stored the output is a `sam` file.\n", "\n", "#### SAM->BAM, sorting and indexing\n", "The `sam` format is a text file which can be further compressed into a `bam` (binary) format. This can be done using Samtools. As usual, find [Samtools](https://wiki.metacentrum.cz/wiki/SAMtools) at MetaCentrum and add module. \n", "\n", "`module add samtools-1.6`\n", "\n", "Now we can do the conversion and we can also see some basic alignment statistics. \n", "\n", "`mkdir $SCRATCH/qc/flagstat\n", "cd $SCRATCH/alignment/`\n", "\n", "`for file in *.sam\n", "do\n", "samtools view -@ $PBS_NUM_PPN -b $file > ${file%.sam}.bam\n", "samtools flagstat ${file%.sam}.bam > $SCRATCH/qc/flagstat/${file%.sam}.flagstat.txt\n", "rm $file # We don't need the sam file anymore\n", "done` \n", " \n", " As you can see, Samtools is not using `-t` or `--threads` to specify number of threads to the calculation but `-@` character. Always check the tools manual for the correct setting of each parameter!.\n", " \n", "We also often need to sort the alignment before further processing and index the sorted bam file. \n", "\n", "`for file in *.bam\n", "do\n", "samtools sort -@ $PBS_NUM_PPN -o ${file%.bam}.sorted.bam $file\n", "samtools index ${file%.bam}.sorted.bam\n", "rm $file # We don't need the unsorted bam file anymore\n", "done`\n", "\n", "#### Copy the results back to the storage\n", "We should not forget we are doing the analysis at the `$SCRATCH`. We have to copy the results back to our storage otherwise they would be deleted in few hours/days. \n", "\n", "Copy the results back to the storage\n", "\n", "`mkdir -p $OUTPUT_DIR\n", "cp -r $SCRATCH/qc $OUTPUT_DIR/ &\n", "cp -r $SCRATCH/alignment $OUTPUT_DIR/ &\n", "wait`\n", "\n", "And it's a good thing to clean after ourselves after the analysis is done and exit the job. \n", "\n", "`rm -r $SCRATCH/*\n", "exit`\n", "\n", "#### Finishing the analysis\n", "Hurray, you have just aligned your first experiment! \n", "\n", "## Visualization\n", "We can also visualize the alignment using [IGV](http://software.broadinstitute.org/software/igv/). This tool is better to use directly at Wolf computers. \n", "\n", "Download the aligned and sorted `bam` file, and the reference genome to the computer here (you can try it on your own). Then, open a new terminal and launch the visualization.\n", "\n", "`scp /storage/brno9-ceitec/home/opplatek/projects/dnaseq/data/references/*.gz opplatek@wolf01.ncbr.muni.cz:/scratch/opplatek/\n", "scp -r /storage/brno9-ceitec/home/opplatek/projects/dnaseq/results opplatek@wolf01.ncbr.muni.cz:/scratch/opplatek/`\n", "\n", "We can logout from MetaCentrum or we can just open a new terminal. \n", "\n", "Before we continue with the visualization we have to decompress (`gunzip`) the reference and index it, this time with `samtools`.\n", "\n", "**On wolf computers decompress and index the reference.** \n", "\n", "First, go to the directory and decompress.\n", "\n", "`cd /scratch/opplatek\n", "gunzip GCF_000005845.2_ASM584v2_genomic.fna.gz`\n", "\n", "And index with `samtools`\n", "\n", "`module add samtools:1.2\n", "samtools faidx GCF_000005845.2_ASM584v2_genomic.fna`\n", "\n", "Load IGV and launch it. \n", "\n", "`module add igv:2.3.60\n", "igv.sh`\n", "\n", "For visualization, please see IGV manual but you will need the reference genome in `fasta` format and the aligned `bam` file and `bai` index file (in the same directory as the `bam` file).\n", "\n", "You can also download IGV to your own computer and run it directly. But be careful, IGV is quite \"sensitive\" on available RAM memory. A lot of loaded `BAM` files will jam your computer. \n", "\n", "## Post-alignmnet QC\n", "We should always check the quality of our alignment before any further processing. One of the tools to get summarized statistics is [Qualimap](http://qualimap.bioinfo.cipf.es/).\n", "\n", "We can download it here (it's a Java tool without a need for installation) and run the QC. \n", "\n", "`wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.2.1.zip\n", "unzip qualimap_v2.2.1.zip`\n", "\n", "Then go to the newly created folder (you can also move it to a directory where you will save the 'installed' tools) and run the analysis\n", "\n", "`cd qualimap_v2.2.1/\n", "./qualimap -h`\n", "\n", "We see that Qualimap has quite a lot of different modules for the QC. We will use `bamqc`.\n", "\n", "`/home/opplatek/qualimap_v2.2.1/qualimap bamqc -bam /scratch/opplatek/results/alignment/SRR1770413.sorted.bam -nt 2`\n", "\n", "Note: You could run this in a loop to get all the results at once. The results are now stored in the same directory as the input `BAM` file. They can be also put in a separate directory using `-outdir` flag in `Qualimap`.\n", "\n", "### Running a regular job\n", "Once you have tested all the commands you can put them to a script and submit the whole workflow as a 'regular' job. You can either create the script file at your computer and then upload it to the MetaCentrum\n", "\n", "`scp opplatek@wolf01.ncbr.muni.cz:/scratch/opplatek/dnaseq.sh opplatek@perian.ncbr.muni.cz:/storage/brno9-ceitec/home/opplatek/projects/dnaseq/src/`\n", "\n", "You have to be logged in at the Wolf computer or your computer, not at the MetaCentrum. \n", "\n", "Or you can create it right away at the MetaCentrum. \n", "\n", "`cd /storage/brno9-ceitec/home/opplatek/projects/dnaseq/src\n", "nano dnaseq.sh`\n", "\n", "and insert the commands you want the job to perform (all the way from specifying variables to copying the results back and cleaning the `$SCRATCH`. Then, by pressing `ctrl+o` you can save the file to the same file as you specified with `nano` command or you can rename the output file.\n", "\n", "Once you have all the commands you can submit the job by 'regular' job submission. \n", "\n", "`cd /storage/brno9-ceitec/home/opplatek/projects/dnaseq/src\n", "qsub -l walltime=1:0:0 -l select=1:ncpus=6:mem=4gb:scratch_local=10gb dnaseq.sh`\n", "\n", "which should submit the job. You can close the terminal and come back within an hour and hope your results will be copied to the output directory you specified in the script. \n", "\n", "To see the queued, running, finished or failed jobs you can visit your [MetaCentrum account](https://metavo.metacentrum.cz/pbsmon2/person) and go for *list of jobs*. You should see all the jobs that finished in past 24/48 hours.\n", "\n", "#### Downloading a bunch of files\n", "If you have more than just two sequencing files I recommend you to downloaded them using the same job submission just explained. You won't have to wait until all is downloaded as the databases might be very slow sometimes. \n", "\n", "I can recommend to make a `download.sh` script in `/storage/brno9-ceitec/home/opplatek/projects/dnaseq/src` with following lines\n", "\n", "`OUTPUT_DIR=/storage/brno9-cerit/home/opplatek/projects/myprojectname\n", "cd $SCRATCH/\n", "wget ftp://...file1.gz &\n", "wget ftp://...file2.gz &\n", "wget ftp://...file3.gz &\n", "wget ftp://...file4.gz &\n", "wait\n", "mkdir -p $OUTPUT_DIR\n", "cp $SCRATCH/*.gz $OUTPUT_DIR\n", "rm -r $SCRATCH/*`\n", "\n", "And the submit it with \n", "\n", "`cd /storage/brno9-ceitec/home/opplatek/projects/dnaseq/src\n", "qsub -l walltime=24:0:0 -l select=1:ncpus=2:mem=60gb:scratch_local=100gb download.sh`\n", "\n", "#### More advanced job submissions and job control\n", "Of course, this is just a simple example. For more sophisticated job submission, script writing and analysis please see MetaCentrum wikipedia listed on the top. \n", "\n", "### Downloading the results to a Windows computer\n", "For information about accessing the MetaCentrum from a windows computer please see **Bi5444_Analysis_of_sequencing_data_lesson_04_unix_intro.pdf** in the study materials of the course. You will need [putty](https://www.putty.org/) to work in a terminal and [WinSCP](https://winscp.net/eng/download.php) to download the data. They are quite simple to use so don't worry. \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }