Practical Bioinformatics Pipelines

This page demonstrates real-world bioinformatics workflows that combine everything you've learned: navigation, text processing, variables, conditionals, loops, functions, and error handling.

The best way to learn shell scripting is by doing. Start with these examples, modify them for your data, and build from there. Every bioinformatician has a collection of trusted scripts.

Quality Control Pipeline

Batch process FASTQ files with quality checks and reporting:

fastq_qc_pipeline.sh

1#!/bin/bash

2# Quality control pipeline for FASTQ files

3# Usage: ./fastq_qc_pipeline.sh <output_dir>

5set -e -u -o pipefail

7# Configuration

8MIN_READS=100000

9MIN_QUALITY=30

10OUTPUT_DIR=${1:-qc_results}

12# Functions

13log() {

14echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"

15}

17die() {

18echo "ERROR: $*" >&2

19exit 1

20}

22validate_fastq() {

23local fastq=$1

24[ -f "$fastq" ] || return 1

25[ -s "$fastq" ] || return 1

27local lines=$(wc -l < "$fastq")

28[ $((lines % 4)) -eq 0 ] || return 1

29return 0

30}

32get_read_count() {

33local fastq=$1

34local lines=$(wc -l < "$fastq")

35echo $((lines / 4))

36}

38estimate_quality() {

39local fastq=$1

40# Get quality scores from first 10000 reads

41awk 'NR % 4 == 0 {

42 for(i=1; i<=length($0); i++) {

43 q = substr($0, i, 1)

44 # Convert ASCII to Phred score (Phred+33)

45 score = (index("!\"#$%&'"'"'()*+,-./0123456789:;<=>?@ABCDEFGHIJ", q) - 1)

46 sum += score

47 count++

48 }

49}

50END {

51 if(count > 0) print int(sum/count)

52 else print 0

53}' <(head -n 40000 "$fastq")

54}

56# Main pipeline

57log "Starting QC pipeline"

58log "Output directory: $OUTPUT_DIR"

60# Create output directory

61mkdir -p "$OUTPUT_DIR"

63# Initialize report

64report="$OUTPUT_DIR/qc_report.txt"

65{

66echo "FASTQ Quality Control Report"

67echo "=============================="

68echo "Generated: $(date)"

69echo ""

70} > "$report"

72# Process each FASTQ

73total=0

74passed=0

75failed=0

77for fastq in *.fastq *.fq 2>/dev/null; do

78[ -f "$fastq" ] || continue

80total=$((total + 1))

81log "Processing $fastq..."

83# Validate format

84if ! validate_fastq "$fastq"; then

85 log " FAILED: Invalid FASTQ format"

86 failed=$((failed + 1))

87 echo "$fastq: FAILED - Invalid format" >> "$report"

88 continue

89fi

91# Count reads

92reads=$(get_read_count "$fastq")

93log " Reads: $reads"

95# Estimate quality

96avg_qual=$(estimate_quality "$fastq")

97log " Average quality: Q$avg_qual"

99# Check criteria

100if [ $reads -lt $MIN_READS ]; then

101 log " FAILED: Too few reads (< $MIN_READS)"

102 failed=$((failed + 1))

103 echo "$fastq: FAILED - $reads reads (< $MIN_READS required)" >> "$report"

104elif [ $avg_qual -lt $MIN_QUALITY ]; then

105 log " FAILED: Low quality (Q$avg_qual < Q$MIN_QUALITY)"

106 failed=$((failed + 1))

107 echo "$fastq: FAILED - Q$avg_qual (< Q$MIN_QUALITY required)" >> "$report"

108else

109 log " PASSED"

110 passed=$((passed + 1))

111 echo "$fastq: PASSED - $reads reads, Q$avg_qual" >> "$report"

112fi

113done

114

115# Summary

116{

117echo ""

118echo "Summary"

119echo "-------"

120echo "Total files: $total"

121echo "Passed: $passed"

122echo "Failed: $failed"

123} >> "$report"

124

125log "Summary: $passed/$total passed"

126log "Report written to $report"

127log "Pipeline complete"

Format Details

Safety: Exit on any error

Config: QC thresholds

Logging: Timestamped messages

Validation: Check FASTQ format

Count reads: FASTQ has 4 lines/read

Quality: Estimate average Phred score

Report: Initialize output file

Process: Loop through all FASTQ files

Criteria: Check read count and quality

Input18.5sSuccess

./fastq_qc_pipeline.sh qc_results

Output

[2024-11-20 10:30:00] Starting QC pipeline
[2024-11-20 10:30:00] Output directory: qc_results
[2024-11-20 10:30:01] Processing Sample_01.fastq...
[2024-11-20 10:30:02]   Reads: 1234567
[2024-11-20 10:30:03]   Average quality: Q35
[2024-11-20 10:30:03]   PASSED
[2024-11-20 10:30:03] Summary: 5/5 passed
[2024-11-20 10:30:03] Report written to qc_results/qc_report.txt
[2024-11-20 10:30:03] Pipeline complete

Process all FASTQs and generate QC report.

Sequence Extraction Pipeline

Extract sequences by ID list and generate statistics:

extract_sequences.sh

1#!/bin/bash

2# Extract sequences from FASTA by ID list

3# Usage: ./extract_sequences.sh <fasta> <id_list> <output>

5set -e -u -o pipefail

7# Validate arguments

8if [ $# -ne 3 ]; then

9echo "Usage: $0 <fasta_file> <id_list> <output_fasta>"

10echo ""

11echo "Extracts sequences matching IDs in id_list from fasta_file"

12echo " fasta_file: Input FASTA file"

13echo " id_list: Text file with one sequence ID per line"

14echo " output_fasta: Output FASTA file"

15exit 1

16fi

18fasta=$1

19id_list=$2

20output=$3

22# Functions

23die() {

24echo "ERROR: $*" >&2

25exit 1

26}

28log() {

29echo "[$(date '+%H:%M:%S')] $*"

30}

32# Validate inputs

33[ -f "$fasta" ] || die "FASTA file not found: $fasta"

34[ -s "$fasta" ] || die "FASTA file is empty: $fasta"

35[ -f "$id_list" ] || die "ID list not found: $id_list"

36[ -s "$id_list" ] || die "ID list is empty: $id_list"

38# Check FASTA format

39first_char=$(head -c 1 "$fasta")

40[ "$first_char" = ">" ] || die "Not FASTA format (must start with >)"

42log "Starting extraction"

43log "Input: $fasta"

44log "IDs: $id_list"

45log "Output: $output"

47# Count inputs

48total_seqs=$(grep -c "^>" "$fasta")

49total_ids=$(wc -l < "$id_list")

50log "Input FASTA has $total_seqs sequences"

51log "Searching for $total_ids IDs"

53# Extract sequences

54> "$output" # Create empty output file

55extracted=0

57while read -r seq_id; do

58# Skip empty lines and comments

59[ -n "$seq_id" ] || continue

60[[ "$seq_id" =~ ^# ]] && continue

62# Search for exact match at start of header

63if grep -A 1 "^>${seq_id}\b" "$fasta" >> "$output"; then

64 extracted=$((extracted + 1))

65 log " Found: $seq_id"

66else

67 log " Missing: $seq_id"

68fi

69done < "$id_list"

71# Generate stats

72output_seqs=$(grep -c "^>" "$output")

74# Summary

75echo ""

76log "Extraction complete"

77log "Requested: $total_ids IDs"

78log "Extracted: $extracted sequences"

79log "Output file: $output ($output_seqs sequences)"

81# Calculate average length

82if [ $output_seqs -gt 0 ]; then

83total_length=$(grep -v "^>" "$output" | tr -d '\n' | wc -c)

84avg_length=$((total_length / output_seqs))

85log "Average length: $avg_length bp"

86fi

88# Exit with error if nothing extracted

89if [ $extracted -eq 0 ]; then

90die "No sequences extracted!"

91fi

93log "Success"

Format Details

Usage: Help message with examples

Validation: Check all inputs exist

Format check: Verify FASTA format

Count: Report input sizes

Extract loop: Process each ID

Skip comments: Allow # comments in ID list

Extract: Find header and next line (sequence)

Stats: Calculate average length

Verify: Fail if nothing found

Extract Genes of Interest

3 steps

cat target_genes.txt

Output

AT1G01010
AT1G01020
AT2G01030

Paired-End Read Processing

Handle R1/R2 paired files with validation:

process_paired_end.sh

1#!/bin/bash

2# Process paired-end reads

3# Usage: ./process_paired_end.sh [options]

5set -e -u -o pipefail

7# Configuration

8MIN_READS=50000

9OUTPUT_DIR="processed"

11# Functions

12log() {

13echo "[$(date '+%H:%M:%S')] $*"

14}

16die() {

17echo "ERROR: $*" >&2

18exit 1

19}

21validate_pair() {

22local r1=$1

23local r2=$2

25# Both files must exist

26[ -f "$r1" ] || return 1

27[ -f "$r2" ] || return 1

29# Must have same line count

30local r1_lines=$(wc -l < "$r1")

31local r2_lines=$(wc -l < "$r2")

33if [ $r1_lines -ne $r2_lines ]; then

34 log " WARNING: Read count mismatch"

35 log " $r1: $r1_lines lines"

36 log " $r2: $r2_lines lines"

37 return 1

38fi

40return 0

41}

43process_pair() {

44local sample=$1

45local r1=$2

46local r2=$3

48log "Processing $sample"

50# Validate

51if ! validate_pair "$r1" "$r2"; then

52 log " SKIPPED: Validation failed"

53 return 1

54fi

56# Count reads

57local reads=$(($(wc -l < "$r1") / 4))

58log " Reads: $reads"

60# Check minimum

61if [ $reads -lt $MIN_READS ]; then

62 log " SKIPPED: Too few reads (< $MIN_READS)"

63 return 1

64fi

66# Create sample output directory

67local sample_dir="$OUTPUT_DIR/$sample"

68mkdir -p "$sample_dir"

70# Process (example: quality filtering)

71log " Quality filtering..."

73# Filter by quality (example using awk)

74awk 'NR % 4 == 0 && length($0) > 20 {

75 # Get previous 3 lines (header, seq, +)

76 print header

77 print seq

78 print plus

79 print $0

80}

81NR % 4 == 1 {header = $0}

82NR % 4 == 2 {seq = $0}

83NR % 4 == 3 {plus = $0}

84' "$r1" > "$sample_dir/${sample}_R1_filtered.fastq"

86awk 'NR % 4 == 0 && length($0) > 20 {

87 print header

88 print seq

89 print plus

90 print $0

91}

92NR % 4 == 1 {header = $0}

93NR % 4 == 2 {seq = $0}

94NR % 4 == 3 {plus = $0}

95' "$r2" > "$sample_dir/${sample}_R2_filtered.fastq"

97# Count filtered reads

98local filtered=$(($(wc -l < "$sample_dir/${sample}_R1_filtered.fastq") / 4))

99local percent=$((100 * filtered / reads))

100

101log " Filtered: $filtered reads ($percent%)"

102log " Output: $sample_dir/"

103

104return 0

105}

106

107# Main pipeline

108log "Starting paired-end processing pipeline"

109mkdir -p "$OUTPUT_DIR"

110

111# Process all R1 files

112total=0

113processed=0

114skipped=0

115

116for r1 in *_R1.fastq *_R1.fq 2>/dev/null; do

117[ -f "$r1" ] || continue

118

119total=$((total + 1))

120

121# Derive R2 filename

122r2=${r1/_R1./_R2.}

123

124# Derive sample name

125sample=${r1/_R1.fastq/}

126sample=${sample/_R1.fq/}

127

128# Process the pair

129if process_pair "$sample" "$r1" "$r2"; then

130 processed=$((processed + 1))

131else

132 skipped=$((skipped + 1))

133fi

134

135echo ""

136done

137

138# Summary

139log "Pipeline complete"

140log "Total pairs: $total"

141log "Processed: $processed"

142log "Skipped: $skipped"

143log "Output directory: $OUTPUT_DIR"

Format Details

Validate pair: Check both exist and match

Count check: R1 and R2 must have same read count

Process pair: Main processing function

Quality filter: Example processing step

119

Main loop: Find and process all pairs

126

Derive names: Calculate R2 and sample from R1

Batch Annotation Pipeline

Extract features from GFF and generate summary:

annotate_batch.sh

1#!/bin/bash

2# Batch annotation from GFF files

3# Usage: ./annotate_batch.sh <gff_dir> <output_dir>

5set -e -u -o pipefail

7# Arguments

8gff_dir=${1:-.}

9output_dir=${2:-annotations}

11# Functions

12log() {

13echo "[$(date '+%H:%M:%S')] $*"

14}

16extract_genes() {

17local gff=$1

18local output=$2

20awk '$3 == "gene" {

21 # Extract gene ID from attributes

22 match($9, /ID=([^;]+)/, arr)

23 gene_id = arr[1]

25 # Calculate length

26 length = $5 - $4 + 1

28 # Print tab-separated output

29 print gene_id, $1, $4, $5, length, $7

30}' OFS="\t" "$gff" > "$output"

31}

33count_features() {

34local gff=$1

36echo "Feature Type Counts:"

37awk '{count[$3]++}

38END {

39 for (type in count)

40 print type, count[type]

41}' OFS="\t" "$gff" | sort -k2 -nr

42}

44# Main pipeline

45log "Starting annotation pipeline"

46log "GFF directory: $gff_dir"

47log "Output directory: $output_dir"

49mkdir -p "$output_dir"

51# Summary file

52summary="$output_dir/summary.txt"

53{

54echo "Annotation Summary"

55echo "=================="

56echo "Generated: $(date)"

57echo ""

58} > "$summary"

60# Process each GFF

61total_files=0

62total_genes=0

64for gff in "$gff_dir"/*.gff "$gff_dir"/*.gff3 2>/dev/null; do

65[ -f "$gff" ] || continue

67total_files=$((total_files + 1))

68basename=$(basename "$gff" .gff)

69basename=${basename%.gff3}

71log "Processing $basename..."

73# Extract genes

74gene_list="$output_dir/${basename}_genes.txt"

75extract_genes "$gff" "$gene_list"

77# Count

78gene_count=$(wc -l < "$gene_list")

79total_genes=$((total_genes + gene_count))

80log " Extracted $gene_count genes"

82# Add to summary

83{

84 echo "$basename"

85 echo "---"

86 echo "Genes: $gene_count"

87 count_features "$gff"

88 echo ""

89} >> "$summary"

90done

92# Final summary

93{

94echo "Total Files: $total_files"

95echo "Total Genes: $total_genes"

96} >> "$summary"

98log "Pipeline complete"

99log "Processed $total_files files, $total_genes genes total"

100log "Summary: $summary"

Format Details

Extract genes: Parse GFF for gene features

Count features: Summarize feature types

Process GFFs: Loop through all GFF files

Extract: Get genes from this GFF

Summarize: Add stats to report

Best Practices Demonstrated

These pipelines show professional bioinformatics scripting:

Production Pipeline Checklist

✓ Safety flags - set -e -u -o pipefail ✓ Input validation - Check files exist and have correct format ✓ Logging - Timestamped progress messages ✓ Error handling - Graceful failures with helpful messages ✓ Progress tracking - Show what's happening ✓ Result validation - Verify outputs make sense ✓ Summary reporting - Generate human-readable reports ✓ Documentation - Comments and usage instructions ✓ Reusable functions - DRY (Don't Repeat Yourself) ✓ Configurable - Parameters at top, not hard-coded throughout

Common Workflow Patterns

Pattern 1: Validate → Process → Report

#!/bin/bash
set -e -u -o pipefail
 
# Validate all inputs first
validate_inputs
 
# Process each item
for item in items; do
  process_item "$item"
done
 
# Generate summary report
create_report

Pattern 2: Parallel Processing Safe

#!/bin/bash
set -e -u -o pipefail
 
# Process files independently (safe for parallel)
for file in *.fastq; do
  output="${file%.fastq}_processed.fastq"
 
  # Each iteration is independent
  process_file "$file" > "$output"
done

Pattern 3: Checkpoint and Resume

#!/bin/bash
set -e -u -o pipefail
 
for sample in samples.txt; do
  output="results/${sample}_done.txt"
 
  # Skip if already processed
  if [ -f "$output" ]; then
    log "Skipping $sample (already done)"
    continue
  fi
 
  # Process
  process_sample "$sample"
 
  # Mark complete
  touch "$output"
done

Deployment Checklist

Before running a pipeline on real data:

Pre-Flight Checklist

Test on small subset - Verify logic with 2-3 files
Check disk space - Ensure enough space for outputs
Verify dependencies - All required tools available
Set resource limits - Don't crash the server
Plan for failures - What if it dies halfway through?
Document runtime - How long should it take?
Validate outputs - Spot check results make sense
Keep logs - Redirect stdout/stderr to log files

Next Steps

You've now seen how all the pieces fit together:

Terminal navigation and file management
Text processing with grep, sed, awk
Variables and control flow
Scripts and functions
Error handling and debugging
Production-ready pipelines

Continue Learning

Practice with your data - Adapt these scripts to your projects
Build a script library - Save and organize useful scripts
Share with lab - Help others automate their workflows
Keep learning - Explore advanced topics like parallel processing

Additional Topics to Explore

Version control - Track script changes with git
Remote computing - SSH and cluster computing
Containers - Docker for reproducible environments
Workflow managers - Snakemake, Nextflow for complex pipelines
Performance optimization - Profiling and speeding up scripts