Practical Bioinformatics Pipelines
This page demonstrates real-world bioinformatics workflows that combine everything you've learned: navigation, text processing, variables, conditionals, loops, functions, and error handling.
The best way to learn shell scripting is by doing. Start with these examples, modify them for your data, and build from there. Every bioinformatician has a collection of trusted scripts.
Quality Control Pipeline
Batch process FASTQ files with quality checks and reporting:
fastq_qc_pipeline.sh
./fastq_qc_pipeline.sh qc_results[2024-11-20 10:30:00] Starting QC pipeline
[2024-11-20 10:30:00] Output directory: qc_results
[2024-11-20 10:30:01] Processing Sample_01.fastq...
[2024-11-20 10:30:02] Reads: 1234567
[2024-11-20 10:30:03] Average quality: Q35
[2024-11-20 10:30:03] PASSED
[2024-11-20 10:30:03] Summary: 5/5 passed
[2024-11-20 10:30:03] Report written to qc_results/qc_report.txt
[2024-11-20 10:30:03] Pipeline completeProcess all FASTQs and generate QC report.
Sequence Extraction Pipeline
Extract sequences by ID list and generate statistics:
extract_sequences.sh
Extract Genes of Interest
3 stepsPaired-End Read Processing
Handle R1/R2 paired files with validation:
process_paired_end.sh
Batch Annotation Pipeline
Extract features from GFF and generate summary:
annotate_batch.sh
Best Practices Demonstrated
These pipelines show professional bioinformatics scripting:
✓ Safety flags - set -e -u -o pipefail
✓ Input validation - Check files exist and have correct format
✓ Logging - Timestamped progress messages
✓ Error handling - Graceful failures with helpful messages
✓ Progress tracking - Show what's happening
✓ Result validation - Verify outputs make sense
✓ Summary reporting - Generate human-readable reports
✓ Documentation - Comments and usage instructions
✓ Reusable functions - DRY (Don't Repeat Yourself)
✓ Configurable - Parameters at top, not hard-coded throughout
Common Workflow Patterns
Pattern 1: Validate → Process → Report
#!/bin/bash
set -e -u -o pipefail
# Validate all inputs first
validate_inputs
# Process each item
for item in items; do
process_item "$item"
done
# Generate summary report
create_reportPattern 2: Parallel Processing Safe
#!/bin/bash
set -e -u -o pipefail
# Process files independently (safe for parallel)
for file in *.fastq; do
output="${file%.fastq}_processed.fastq"
# Each iteration is independent
process_file "$file" > "$output"
donePattern 3: Checkpoint and Resume
#!/bin/bash
set -e -u -o pipefail
for sample in samples.txt; do
output="results/${sample}_done.txt"
# Skip if already processed
if [ -f "$output" ]; then
log "Skipping $sample (already done)"
continue
fi
# Process
process_sample "$sample"
# Mark complete
touch "$output"
doneDeployment Checklist
Before running a pipeline on real data:
- Test on small subset - Verify logic with 2-3 files
- Check disk space - Ensure enough space for outputs
- Verify dependencies - All required tools available
- Set resource limits - Don't crash the server
- Plan for failures - What if it dies halfway through?
- Document runtime - How long should it take?
- Validate outputs - Spot check results make sense
- Keep logs - Redirect stdout/stderr to log files
Next Steps
You've now seen how all the pieces fit together:
- Terminal navigation and file management
- Text processing with grep, sed, awk
- Variables and control flow
- Scripts and functions
- Error handling and debugging
- Production-ready pipelines
Continue Learning
- Practice with your data - Adapt these scripts to your projects
- Build a script library - Save and organize useful scripts
- Share with lab - Help others automate their workflows
- Keep learning - Explore advanced topics like parallel processing
Additional Topics to Explore
- Version control - Track script changes with git
- Remote computing - SSH and cluster computing
- Containers - Docker for reproducible environments
- Workflow managers - Snakemake, Nextflow for complex pipelines
- Performance optimization - Profiling and speeding up scripts
Further Reading
- GNU Bash Manual
- Advanced Bash-Scripting Guide
- Bioinformatics Data Skills by Vince Buffalo
- Command Line Bioinformatics
Congratulations! You've completed the Shell Scripting section. You have all the tools needed to automate bioinformatics workflows. The best way to solidify this knowledge is to start writing scripts for your own research projects.