Evomics Docs
UNIX for Biologists/Practical Bioinformatics Pipelines

Practical Bioinformatics Pipelines

This page demonstrates real-world bioinformatics workflows that combine everything you've learned: navigation, text processing, variables, conditionals, loops, functions, and error handling.

The best way to learn shell scripting is by doing. Start with these examples, modify them for your data, and build from there. Every bioinformatician has a collection of trusted scripts.

Quality Control Pipeline

Batch process FASTQ files with quality checks and reporting:

fastq_qc_pipeline.sh

1#!/bin/bash
2# Quality control pipeline for FASTQ files
3# Usage: ./fastq_qc_pipeline.sh <output_dir>
4
5set -e -u -o pipefail
6
7# Configuration
8MIN_READS=100000
9MIN_QUALITY=30
10OUTPUT_DIR=${1:-qc_results}
11
12# Functions
13log() {
14echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"
15}
16
17die() {
18echo "ERROR: $*" >&2
19exit 1
20}
21
22validate_fastq() {
23local fastq=$1
24[ -f "$fastq" ] || return 1
25[ -s "$fastq" ] || return 1
26
27local lines=$(wc -l < "$fastq")
28[ $((lines % 4)) -eq 0 ] || return 1
29return 0
30}
31
32get_read_count() {
33local fastq=$1
34local lines=$(wc -l < "$fastq")
35echo $((lines / 4))
36}
37
38estimate_quality() {
39local fastq=$1
40# Get quality scores from first 10000 reads
41awk 'NR % 4 == 0 {
42 for(i=1; i<=length($0); i++) {
43 q = substr($0, i, 1)
44 # Convert ASCII to Phred score (Phred+33)
45 score = (index("!\"#$%&'"'"'()*+,-./0123456789:;<=>?@ABCDEFGHIJ", q) - 1)
46 sum += score
47 count++
48 }
49}
50END {
51 if(count > 0) print int(sum/count)
52 else print 0
53}' <(head -n 40000 "$fastq")
54}
55
56# Main pipeline
57log "Starting QC pipeline"
58log "Output directory: $OUTPUT_DIR"
59
60# Create output directory
61mkdir -p "$OUTPUT_DIR"
62
63# Initialize report
64report="$OUTPUT_DIR/qc_report.txt"
65{
66echo "FASTQ Quality Control Report"
67echo "=============================="
68echo "Generated: $(date)"
69echo ""
70} > "$report"
71
72# Process each FASTQ
73total=0
74passed=0
75failed=0
76
77for fastq in *.fastq *.fq 2>/dev/null; do
78[ -f "$fastq" ] || continue
79
80total=$((total + 1))
81log "Processing $fastq..."
82
83# Validate format
84if ! validate_fastq "$fastq"; then
85 log " FAILED: Invalid FASTQ format"
86 failed=$((failed + 1))
87 echo "$fastq: FAILED - Invalid format" >> "$report"
88 continue
89fi
90
91# Count reads
92reads=$(get_read_count "$fastq")
93log " Reads: $reads"
94
95# Estimate quality
96avg_qual=$(estimate_quality "$fastq")
97log " Average quality: Q$avg_qual"
98
99# Check criteria
100if [ $reads -lt $MIN_READS ]; then
101 log " FAILED: Too few reads (< $MIN_READS)"
102 failed=$((failed + 1))
103 echo "$fastq: FAILED - $reads reads (< $MIN_READS required)" >> "$report"
104elif [ $avg_qual -lt $MIN_QUALITY ]; then
105 log " FAILED: Low quality (Q$avg_qual < Q$MIN_QUALITY)"
106 failed=$((failed + 1))
107 echo "$fastq: FAILED - Q$avg_qual (< Q$MIN_QUALITY required)" >> "$report"
108else
109 log " PASSED"
110 passed=$((passed + 1))
111 echo "$fastq: PASSED - $reads reads, Q$avg_qual" >> "$report"
112fi
113done
114
115# Summary
116{
117echo ""
118echo "Summary"
119echo "-------"
120echo "Total files: $total"
121echo "Passed: $passed"
122echo "Failed: $failed"
123} >> "$report"
124
125log "Summary: $passed/$total passed"
126log "Report written to $report"
127log "Pipeline complete"
Format Details
5
Safety: Exit on any error
8
Config: QC thresholds
13
Logging: Timestamped messages
22
Validation: Check FASTQ format
32
Count reads: FASTQ has 4 lines/read
38
Quality: Estimate average Phred score
58
Report: Initialize output file
71
Process: Loop through all FASTQ files
99
Criteria: Check read count and quality
Input18.5sSuccess
./fastq_qc_pipeline.sh qc_results
Output
[2024-11-20 10:30:00] Starting QC pipeline
[2024-11-20 10:30:00] Output directory: qc_results
[2024-11-20 10:30:01] Processing Sample_01.fastq...
[2024-11-20 10:30:02]   Reads: 1234567
[2024-11-20 10:30:03]   Average quality: Q35
[2024-11-20 10:30:03]   PASSED
[2024-11-20 10:30:03] Summary: 5/5 passed
[2024-11-20 10:30:03] Report written to qc_results/qc_report.txt
[2024-11-20 10:30:03] Pipeline complete

Process all FASTQs and generate QC report.

Sequence Extraction Pipeline

Extract sequences by ID list and generate statistics:

extract_sequences.sh

1#!/bin/bash
2# Extract sequences from FASTA by ID list
3# Usage: ./extract_sequences.sh <fasta> <id_list> <output>
4
5set -e -u -o pipefail
6
7# Validate arguments
8if [ $# -ne 3 ]; then
9echo "Usage: $0 <fasta_file> <id_list> <output_fasta>"
10echo ""
11echo "Extracts sequences matching IDs in id_list from fasta_file"
12echo " fasta_file: Input FASTA file"
13echo " id_list: Text file with one sequence ID per line"
14echo " output_fasta: Output FASTA file"
15exit 1
16fi
17
18fasta=$1
19id_list=$2
20output=$3
21
22# Functions
23die() {
24echo "ERROR: $*" >&2
25exit 1
26}
27
28log() {
29echo "[$(date '+%H:%M:%S')] $*"
30}
31
32# Validate inputs
33[ -f "$fasta" ] || die "FASTA file not found: $fasta"
34[ -s "$fasta" ] || die "FASTA file is empty: $fasta"
35[ -f "$id_list" ] || die "ID list not found: $id_list"
36[ -s "$id_list" ] || die "ID list is empty: $id_list"
37
38# Check FASTA format
39first_char=$(head -c 1 "$fasta")
40[ "$first_char" = ">" ] || die "Not FASTA format (must start with >)"
41
42log "Starting extraction"
43log "Input: $fasta"
44log "IDs: $id_list"
45log "Output: $output"
46
47# Count inputs
48total_seqs=$(grep -c "^>" "$fasta")
49total_ids=$(wc -l < "$id_list")
50log "Input FASTA has $total_seqs sequences"
51log "Searching for $total_ids IDs"
52
53# Extract sequences
54> "$output" # Create empty output file
55extracted=0
56
57while read -r seq_id; do
58# Skip empty lines and comments
59[ -n "$seq_id" ] || continue
60[[ "$seq_id" =~ ^# ]] && continue
61
62# Search for exact match at start of header
63if grep -A 1 "^>${seq_id}\b" "$fasta" >> "$output"; then
64 extracted=$((extracted + 1))
65 log " Found: $seq_id"
66else
67 log " Missing: $seq_id"
68fi
69done < "$id_list"
70
71# Generate stats
72output_seqs=$(grep -c "^>" "$output")
73
74# Summary
75echo ""
76log "Extraction complete"
77log "Requested: $total_ids IDs"
78log "Extracted: $extracted sequences"
79log "Output file: $output ($output_seqs sequences)"
80
81# Calculate average length
82if [ $output_seqs -gt 0 ]; then
83total_length=$(grep -v "^>" "$output" | tr -d '\n' | wc -c)
84avg_length=$((total_length / output_seqs))
85log "Average length: $avg_length bp"
86fi
87
88# Exit with error if nothing extracted
89if [ $extracted -eq 0 ]; then
90die "No sequences extracted!"
91fi
92
93log "Success"
Format Details
8
Usage: Help message with examples
35
Validation: Check all inputs exist
41
Format check: Verify FASTA format
52
Count: Report input sizes
58
Extract loop: Process each ID
64
Skip comments: Allow # comments in ID list
67
Extract: Find header and next line (sequence)
84
Stats: Calculate average length
92
Verify: Fail if nothing found

Extract Genes of Interest

3 steps
cat target_genes.txt
Output
AT1G01010
AT1G01020
AT2G01030

Paired-End Read Processing

Handle R1/R2 paired files with validation:

process_paired_end.sh

1#!/bin/bash
2# Process paired-end reads
3# Usage: ./process_paired_end.sh [options]
4
5set -e -u -o pipefail
6
7# Configuration
8MIN_READS=50000
9OUTPUT_DIR="processed"
10
11# Functions
12log() {
13echo "[$(date '+%H:%M:%S')] $*"
14}
15
16die() {
17echo "ERROR: $*" >&2
18exit 1
19}
20
21validate_pair() {
22local r1=$1
23local r2=$2
24
25# Both files must exist
26[ -f "$r1" ] || return 1
27[ -f "$r2" ] || return 1
28
29# Must have same line count
30local r1_lines=$(wc -l < "$r1")
31local r2_lines=$(wc -l < "$r2")
32
33if [ $r1_lines -ne $r2_lines ]; then
34 log " WARNING: Read count mismatch"
35 log " $r1: $r1_lines lines"
36 log " $r2: $r2_lines lines"
37 return 1
38fi
39
40return 0
41}
42
43process_pair() {
44local sample=$1
45local r1=$2
46local r2=$3
47
48log "Processing $sample"
49
50# Validate
51if ! validate_pair "$r1" "$r2"; then
52 log " SKIPPED: Validation failed"
53 return 1
54fi
55
56# Count reads
57local reads=$(($(wc -l < "$r1") / 4))
58log " Reads: $reads"
59
60# Check minimum
61if [ $reads -lt $MIN_READS ]; then
62 log " SKIPPED: Too few reads (< $MIN_READS)"
63 return 1
64fi
65
66# Create sample output directory
67local sample_dir="$OUTPUT_DIR/$sample"
68mkdir -p "$sample_dir"
69
70# Process (example: quality filtering)
71log " Quality filtering..."
72
73# Filter by quality (example using awk)
74awk 'NR % 4 == 0 && length($0) > 20 {
75 # Get previous 3 lines (header, seq, +)
76 print header
77 print seq
78 print plus
79 print $0
80}
81NR % 4 == 1 {header = $0}
82NR % 4 == 2 {seq = $0}
83NR % 4 == 3 {plus = $0}
84' "$r1" > "$sample_dir/${sample}_R1_filtered.fastq"
85
86awk 'NR % 4 == 0 && length($0) > 20 {
87 print header
88 print seq
89 print plus
90 print $0
91}
92NR % 4 == 1 {header = $0}
93NR % 4 == 2 {seq = $0}
94NR % 4 == 3 {plus = $0}
95' "$r2" > "$sample_dir/${sample}_R2_filtered.fastq"
96
97# Count filtered reads
98local filtered=$(($(wc -l < "$sample_dir/${sample}_R1_filtered.fastq") / 4))
99local percent=$((100 * filtered / reads))
100
101log " Filtered: $filtered reads ($percent%)"
102log " Output: $sample_dir/"
103
104return 0
105}
106
107# Main pipeline
108log "Starting paired-end processing pipeline"
109mkdir -p "$OUTPUT_DIR"
110
111# Process all R1 files
112total=0
113processed=0
114skipped=0
115
116for r1 in *_R1.fastq *_R1.fq 2>/dev/null; do
117[ -f "$r1" ] || continue
118
119total=$((total + 1))
120
121# Derive R2 filename
122r2=${r1/_R1./_R2.}
123
124# Derive sample name
125sample=${r1/_R1.fastq/}
126sample=${sample/_R1.fq/}
127
128# Process the pair
129if process_pair "$sample" "$r1" "$r2"; then
130 processed=$((processed + 1))
131else
132 skipped=$((skipped + 1))
133fi
134
135echo ""
136done
137
138# Summary
139log "Pipeline complete"
140log "Total pairs: $total"
141log "Processed: $processed"
142log "Skipped: $skipped"
143log "Output directory: $OUTPUT_DIR"
Format Details
21
Validate pair: Check both exist and match
30
Count check: R1 and R2 must have same read count
42
Process pair: Main processing function
69
Quality filter: Example processing step
119
Main loop: Find and process all pairs
126
Derive names: Calculate R2 and sample from R1

Batch Annotation Pipeline

Extract features from GFF and generate summary:

annotate_batch.sh

1#!/bin/bash
2# Batch annotation from GFF files
3# Usage: ./annotate_batch.sh <gff_dir> <output_dir>
4
5set -e -u -o pipefail
6
7# Arguments
8gff_dir=${1:-.}
9output_dir=${2:-annotations}
10
11# Functions
12log() {
13echo "[$(date '+%H:%M:%S')] $*"
14}
15
16extract_genes() {
17local gff=$1
18local output=$2
19
20awk '$3 == "gene" {
21 # Extract gene ID from attributes
22 match($9, /ID=([^;]+)/, arr)
23 gene_id = arr[1]
24
25 # Calculate length
26 length = $5 - $4 + 1
27
28 # Print tab-separated output
29 print gene_id, $1, $4, $5, length, $7
30}' OFS="\t" "$gff" > "$output"
31}
32
33count_features() {
34local gff=$1
35
36echo "Feature Type Counts:"
37awk '{count[$3]++}
38END {
39 for (type in count)
40 print type, count[type]
41}' OFS="\t" "$gff" | sort -k2 -nr
42}
43
44# Main pipeline
45log "Starting annotation pipeline"
46log "GFF directory: $gff_dir"
47log "Output directory: $output_dir"
48
49mkdir -p "$output_dir"
50
51# Summary file
52summary="$output_dir/summary.txt"
53{
54echo "Annotation Summary"
55echo "=================="
56echo "Generated: $(date)"
57echo ""
58} > "$summary"
59
60# Process each GFF
61total_files=0
62total_genes=0
63
64for gff in "$gff_dir"/*.gff "$gff_dir"/*.gff3 2>/dev/null; do
65[ -f "$gff" ] || continue
66
67total_files=$((total_files + 1))
68basename=$(basename "$gff" .gff)
69basename=${basename%.gff3}
70
71log "Processing $basename..."
72
73# Extract genes
74gene_list="$output_dir/${basename}_genes.txt"
75extract_genes "$gff" "$gene_list"
76
77# Count
78gene_count=$(wc -l < "$gene_list")
79total_genes=$((total_genes + gene_count))
80log " Extracted $gene_count genes"
81
82# Add to summary
83{
84 echo "$basename"
85 echo "---"
86 echo "Genes: $gene_count"
87 count_features "$gff"
88 echo ""
89} >> "$summary"
90done
91
92# Final summary
93{
94echo "Total Files: $total_files"
95echo "Total Genes: $total_genes"
96} >> "$summary"
97
98log "Pipeline complete"
99log "Processed $total_files files, $total_genes genes total"
100log "Summary: $summary"
Format Details
17
Extract genes: Parse GFF for gene features
33
Count features: Summarize feature types
62
Process GFFs: Loop through all GFF files
74
Extract: Get genes from this GFF
83
Summarize: Add stats to report

Best Practices Demonstrated

These pipelines show professional bioinformatics scripting:

Production Pipeline Checklist

Safety flags - set -e -u -o pipefailInput validation - Check files exist and have correct format ✓ Logging - Timestamped progress messages ✓ Error handling - Graceful failures with helpful messages ✓ Progress tracking - Show what's happening ✓ Result validation - Verify outputs make sense ✓ Summary reporting - Generate human-readable reports ✓ Documentation - Comments and usage instructions ✓ Reusable functions - DRY (Don't Repeat Yourself) ✓ Configurable - Parameters at top, not hard-coded throughout

Common Workflow Patterns

Pattern 1: Validate → Process → Report

#!/bin/bash set -e -u -o pipefail # Validate all inputs first validate_inputs # Process each item for item in items; do process_item "$item" done # Generate summary report create_report

Pattern 2: Parallel Processing Safe

#!/bin/bash set -e -u -o pipefail # Process files independently (safe for parallel) for file in *.fastq; do output="${file%.fastq}_processed.fastq" # Each iteration is independent process_file "$file" > "$output" done

Pattern 3: Checkpoint and Resume

#!/bin/bash set -e -u -o pipefail for sample in samples.txt; do output="results/${sample}_done.txt" # Skip if already processed if [ -f "$output" ]; then log "Skipping $sample (already done)" continue fi # Process process_sample "$sample" # Mark complete touch "$output" done

Deployment Checklist

Before running a pipeline on real data:

Pre-Flight Checklist
  1. Test on small subset - Verify logic with 2-3 files
  2. Check disk space - Ensure enough space for outputs
  3. Verify dependencies - All required tools available
  4. Set resource limits - Don't crash the server
  5. Plan for failures - What if it dies halfway through?
  6. Document runtime - How long should it take?
  7. Validate outputs - Spot check results make sense
  8. Keep logs - Redirect stdout/stderr to log files

Next Steps

You've now seen how all the pieces fit together:

  • Terminal navigation and file management
  • Text processing with grep, sed, awk
  • Variables and control flow
  • Scripts and functions
  • Error handling and debugging
  • Production-ready pipelines

Continue Learning

  • Practice with your data - Adapt these scripts to your projects
  • Build a script library - Save and organize useful scripts
  • Share with lab - Help others automate their workflows
  • Keep learning - Explore advanced topics like parallel processing

Additional Topics to Explore

  • Version control - Track script changes with git
  • Remote computing - SSH and cluster computing
  • Containers - Docker for reproducible environments
  • Workflow managers - Snakemake, Nextflow for complex pipelines
  • Performance optimization - Profiling and speeding up scripts

Further Reading


Congratulations! You've completed the Shell Scripting section. You have all the tools needed to automate bioinformatics workflows. The best way to solidify this knowledge is to start writing scripts for your own research projects.