Unix/Linux Crash Course for Computational Biologists RECETOX Bioinformatics September 24, 2025 1 What is Linux? • Linux = operating system kernel created in 1991 by Linus Torvalds. • Inspired by Unix (1970s, AT&T Bell Labs) — designed for multi-user, multitasking systems. • Distributed as open-source software under the GNU General Public License (GPL). • Most systems we call “Linux” are really GNU/Linux distributions (Ubuntu, CentOS, Debian, Fedora, etc.). • Powers servers, smartphones (Android), supercomputers, embedded devices, and HPC clusters. • Core philosophy: small, modular tools combined into powerful workflows. 2 Tux 3 Why Linux for computational biology? • Ubiquity: Most HPC clusters, cloud VMs, and Docker/Singularity containers run Linux by default. • Tool availability: Bioinformatics software (bwa, samtools, STAR, Nextflow, nf-core) is developed and tested on Linux first. • Performance: Efficient use of CPU, memory, GPUs, and large storage systems. • Remote access: Multi-user servers accessed via SSH are all Linux-based; enables collaborative compute. • Automation: Shell scripting and command-line tools combine to form reproducible pipelines. • Package ecosystems: Conda/mamba (with Bioconda), apt, yum, etc., make installing software easier. • Reproducibility and containers: Linux is the base for Docker/Singularity images, ensuring portable environments. • Community and support: Tutorials, mailing lists, forums, and shared practices 4 Shell and Terminal • Terminal: where you type. Shell: command interpreter (bash). • Prompt: user@server: $ Try: bash --version echo "Hello, shell!" 5 Anatomy of a Linux command • General structure: command [options] [arguments] • Command: the program you are calling (e.g., ls, grep, samtools). • Options / flags: modify behavior. • Short: -h, -l, often can be combined (ls -lah). • Long: –help, –color=auto. • Arguments: the input(s) for the command (files, patterns, numbers). • Example: grep -Rin –include "*.tsv" "pattern" data/ • grep = command • -Rin and –include "*.tsv" = options • "pattern" = search term (argument) • data/ = where to search (argument) 6 Filesystem mental model • Single tree from /. Home is ˜. • Hidden files start with dot (e.g., ˜/.bashrc). • Case-sensitive names. Explore: ls / ls -al ~ 7 Paths: absolute vs relative • Absolute: /data/project1/results/ • Relative: results/ • Shortcuts: . (here), .. (parent), ˜ (home) pwd cd /mnt/data/unix_linux_crash_course/data cd ../ cd ~/scratch 8 Core navigation & listing pwd ls -lah cd path/to/dir mkdir newdir cp source dest cp -r dirA dirB mv oldname newname rm file ; rm -r dir/ 9 Inspect files cat file.txt less file.txt # q quits, /pattern searches head -n 20 file.txt tail -n 20 file.txt tail -f logs/app.log Bio examples: zcat data/reads.fastq.gz | head -n 8 10 Find things which bwa find . -name "*.fastq.gz" grep "^@" data/reads.fastq grep -Rin --include "*.tsv" "pattern" . 11 Short (-h) vs Long (–help) flags • Use –help and man to learn options. command --help man command ls -lh --color=auto cut -f1,3 --delimiter=$'\t' data/table.tsv 12 Redirection and pipes • > overwrite, » append, | pipe. echo "sample_id\treads" > counts.tsv wc -l data/*.fastq.gz >> counts.tsv zcat data/reads.fastq.gz \ | paste - - - - \ | cut -f2 \ | awk '{sum+=length($0)} END {print "bases\t" sum}' date | tee run.log 13 Stdin redirection & globs sort < data/unsorted.list > sorted.list ls data/*.fastq.gz rm data/sample??.tmp 14 Variables and quoting x=42 echo $x samples="S1 S2 S3" for s in $samples; do echo $s; done export PATH=$HOME/bin:$PATH pattern="^@" grep "$pattern" data/reads.fastq date_str=$(date +%Y-%m-%d) echo "Today is $date_str" 15 Permissions (quick peek) ls -l file chmod u+x script.sh chmod g+r data.tsv chmod o-rwx secret.txt 16 Processes & resources nproc free -h df -h ps -u $USER top # q to quit kill 12345 17 Remote & transfer ssh user@server.example.org scp localfile user@server:~/ rsync -avP data/ user@server:/data/project/ 18 Mini-exercises 1. Count reads (4 lines/record): zcat data/reads.fastq.gz | paste - - - - | wc -l 2. Largest 10 FASTQs: find . -name "*.fastq.gz" -printf '%s\t%p\n' | sort -n | tail -n 10 3. Extract sample IDs from *R1.fastq.gz : ls data/*_R1.fastq.gz | sed 's/\_R1.fastq.gz//' | sort -u 4. Find "ERROR" in logs and save: grep -Rin "ERROR" data/logs/ | tee errors.txt 19 Cheat sheet Navigation: pwd, ls -lah, cd, mkdir, cp, mv, rm -r Inspect: cat, less, head, tail -f, wc -l Find & filter: find, grep -Rin, cut, sort, uniq, awk, sed Help: –help, man Globs: *, ?, [ACGT] Redirect/pipe: >, », |, tee, < Vars: x=1, echo $x, export PATH=..., quotes, $() Processes: nproc, free -h, df -h, ps, top, kill Remote: ssh, scp, rsync -avP 20 Conditions in Bash if [ -f "data/reads.fastq.gz" ]; then echo "FASTQ present" else echo "File missing!" fi • -f : file exists • -d : directory exists • -eq, -lt, -gt : numeric tests 21 Loops in Bash: for # Loop through sample IDs for s in S1 S2 S3; do echo "Processing $s" done # Loop through files for f in data/*.fastq.gz; do echo "File: $f" done 22 Loops in Bash: while n=3 while [ $n -gt 0 ]; do echo $n n=$((n-1)) done 23 Example: batch processing for f in data/*.fastq.gz; do if [ -s "$f" ]; then # check file not empty echo "QC on $f" fastqc $f -o results/ else echo "Skipping $f (empty)" fi done 24 Conditions & Loops: Quick Reference Conditions (tests) -f file file exists -d dir directory exists -s file file exists and not empty -eq, -lt, -gt numeric comparisons ==, != string comparisons Loops for var in list; do ...; done iterate over list or files while [ condition ]; do ...; done repeat until condition fails Tip: Combine loops + conditions for batch processing (e.g. QC all FASTQs). 25 What is a shell script? • A plain text file with a list of shell commands. • Lets you automate workflows instead of retyping commands. • Run as a program after making it executable. 26 Writing a Bash script #!/bin/bash # myscript.sh - count reads in all FASTQ files for f in data/*.fastq.gz; do echo "Counting reads in $f" n=$(zcat $f | wc -l) echo "$f has $((n/4)) reads" done 27 Running a Bash script # Make it executable (once) chmod +x myscript.sh # Run with relative or absolute path ./myscript.sh • Always include the shebang line (#!/bin/bash) at the top. • Can also run with bash myscript.sh. • Keep scripts in $HOME/bin or project folder. 28 Best practices for scripts • Start with #!/bin/bash (or #!/usr/bin/env bash). • Comment your code with #. • Use variables for filenames and parameters. • Use loops + conditions to process many files. • Store scripts under version control (git). • Share with colleagues for reproducible workflows. 29