Writing Shell Scripts

A shell script is a file containing commands that run sequentially. Instead of typing the same commands repeatedly, write them once in a script and execute it whenever needed.

Shell scripts are the bridge between individual commands and complex bioinformatics pipelines. A well-written script is self-documenting, reusable, and handles errors gracefully.

Your First Script

Create a file called hello.sh:

hello.sh

1#!/bin/bash

2# My first shell script

4echo "Hello from shell script!"

5echo "Current directory: $(pwd)"

6echo "Today is: $(date)"

Format Details

Shebang: Tells system to use bash

Comment: Documentation for humans

Commands: Regular commands, one per line

Make it Executable

InputSuccess

chmod +x hello.sh

Make the script executable. +x adds execute permission.

Run the Script

InputSuccess

./hello.sh

Output

Hello from shell script!
Current directory: /home/user/analysis
Today is: Wed Nov 20 10:30:00 EST 2024

Execute the script. ./ means 'current directory'.

Why ./hello.sh and not just hello.sh?

For security, the current directory (.) is not in your PATH. You must explicitly specify ./ to run scripts in the current directory. This prevents accidentally running malicious scripts.

The Shebang Line

The first line tells the system which interpreter to use.

InputSuccess

head -n 1 analyze.sh

Output

#!/bin/bash

Shebang line. Always first line, no spaces before #!

Common shebangs:

#!/bin/bash - Use bash (most common for bioinformatics)
#!/bin/sh - Use basic sh (more portable, fewer features)
#!/usr/bin/env bash - Find bash in PATH (works across systems)

Use #!/usr/bin/env bash for maximum portability. This finds bash wherever it's installed, rather than assuming it's in /bin/bash.

Script Arguments

Access command-line arguments with special variables:

greet.sh

1#!/bin/bash

2# Demonstrate script arguments

4echo "Script name: $0"

5echo "First argument: $1"

6echo "Second argument: $2"

7echo "All arguments: $@"

8echo "Number of arguments: $#"

Format Details

$0: Script name

$1, $2: Individual arguments

$@: All arguments as separate words

$#: Count of arguments

InputSuccess

./greet.sh Alice Bob

Output

Script name: ./greet.sh
First argument: Alice
Second argument: Bob
All arguments: Alice Bob
Number of arguments: 2

Arguments are separated by spaces. $1 is Alice, $2 is Bob.

Checking Arguments

Always validate that required arguments are provided:

process_sample.sh

1#!/bin/bash

2# Process a sample with error checking

4if [ $# -ne 1 ]; then

5echo "Usage: $0 <sample_id>"

6echo "Example: $0 Sample_01"

7exit 1

8fi

10sample_id=$1

11echo "Processing sample: $sample_id"

12# ... rest of analysis

Format Details

Check count: Verify exactly 1 argument

Usage message: Help user understand syntax

Exit error: Exit with non-zero code for error

Use argument: Store in descriptive variable

InputSuccess

./process_sample.sh

Output

Usage: ./process_sample.sh <sample_id>
Example: ./process_sample.sh Sample_01

Script rejects missing arguments with helpful message.

Exit Codes

Scripts return exit codes: 0 for success, non-zero for errors.

InputSuccess

./successful_script.sh
echo $?

Output

The special variable $? contains the exit code of the last command. 0 means success.

check_file.sh

1#!/bin/bash

2# Check if required file exists

4if [ ! -f "$1" ]; then

5echo "ERROR: File not found: $1"

6exit 1

7fi

9echo "File exists: $1"

10exit 0

Format Details

Test existence: ! negates the test

Exit with error: 1 indicates failure

Exit success: 0 indicates success

InputSuccess

./check_file.sh missing.txt
echo "Exit code: $?"

Output

ERROR: File not found: missing.txt
Exit code: 1

Non-zero exit code signals error to calling program.

Using Exit Codes in Workflows

InputSuccess

if ./check_file.sh sequences.fasta; then
  echo "File OK, proceeding with analysis"
else
  echo "Cannot proceed - fix errors first"
fi

Output

File exists: sequences.fasta
File OK, proceeding with analysis

Use script exit codes in conditionals. Enables chaining scripts.

Practical Bioinformatics Scripts

Count Reads in FASTQ

count_reads.sh

1#!/bin/bash

2# Count reads in a FASTQ file

3# Usage: ./count_reads.sh <fastq_file>

5if [ $# -ne 1 ]; then

6echo "Usage: $0 <fastq_file>"

7exit 1

8fi

10fastq_file=$1

12# Check file exists

13if [ ! -f "$fastq_file" ]; then

14echo "ERROR: File not found: $fastq_file"

15exit 1

16fi

18# Count lines and divide by 4

19lines=$(wc -l < "$fastq_file")

20reads=$((lines / 4))

22echo "$fastq_file: $reads reads"

Format Details

Description: Explain what script does

Usage doc: Show how to run it

Validate args: Check argument count

Validate input: Check file exists

Calculate: FASTQ has 4 lines per read

Input0.8sSuccess

chmod +x count_reads.sh\n./count_reads.sh sample.fastq

Output

sample.fastq: 1234567 reads

Reusable script for any FASTQ file.

Extract Gene Sequences

extract_genes.sh

1#!/bin/bash

2# Extract specific genes from FASTA by ID list

3# Usage: ./extract_genes.sh <fasta> <gene_list>

5if [ $# -ne 2 ]; then

6echo "Usage: $0 <fasta_file> <gene_list>"

7exit 1

8fi

10fasta=$1

11gene_list=$2

13# Validate inputs

14if [ ! -f "$fasta" ]; then

15echo "ERROR: FASTA file not found: $fasta"

16exit 1

17fi

19if [ ! -f "$gene_list" ]; then

20echo "ERROR: Gene list not found: $gene_list"

21exit 1

22fi

24# Extract sequences

25output="extracted_genes.fasta"

26> "$output" # Create empty file

28while read -r gene_id; do

29# Extract header and sequence

30awk -v id="$gene_id" '

31 $0 ~ "^>"id {print; getline; print}

32' "$fasta" >> "$output"

33done < "$gene_list"

35echo "Extracted $(grep -c "^>" "$output") sequences to $output"

Format Details

Input validation: Check both files exist

Create output: > creates empty file

Read list: Process each gene ID

Extract seq: Use awk to find and extract

Report results: Count extracted sequences

InputSuccess

cat genes_of_interest.txt

Output

AT1G01010
AT1G01020
AT1G01030

Gene list file - one ID per line.

Input1.2sSuccess

./extract_genes.sh genome.fasta genes_of_interest.txt

Output

Extracted 3 sequences to extracted_genes.fasta

Extract specific genes from genome FASTA.

Script Structure Best Practices

Well-Structured Script Template

1#!/bin/bash

2# Script: analyze_sample.sh

3# Description: Quality check and basic stats for FASTQ

4# Author: Your Name

5# Date: 2024-11-20

6# Usage: ./analyze_sample.sh <sample.fastq>

8# Exit on any error

9set -e

10set -u # Exit on undefined variable

11set -o pipefail # Catch errors in pipes

13# Validate arguments

14if [ $# -ne 1 ]; then

15echo "Usage: $0 <fastq_file>"

16exit 1

17fi

19# Store arguments in named variables

20fastq_file=$1

22# Check inputs exist

23if [ ! -f "$fastq_file" ]; then

24echo "ERROR: File not found: $fastq_file"

25exit 1

26fi

28# Main analysis

29echo "Analyzing $fastq_file..."

31total_reads=$(wc -l < "$fastq_file")

32reads=$((total_reads / 4))

34echo "Total reads: $reads"

35echo "Analysis complete"

37exit 0

Format Details

Header: Document the script

Safety flags: Exit on errors

Validation: Check all inputs

Named vars: Use descriptive names

Main logic: Core functionality

Safety Flags

set -e - Exit immediately if any command fails set -u - Exit if using undefined variable set -o pipefail - Exit if any command in a pipeline fails

These prevent silent errors and make scripts more robust. Add them after the shebang line.

Looping Over Multiple Files

batch_process.sh

1#!/bin/bash

2# Process all FASTQ files in directory

4set -e

6for fastq in *.fastq; do

7# Skip if no files match

8[ -f "$fastq" ] || continue

10echo "Processing $fastq..."

12# Count reads

13lines=$(wc -l < "$fastq")

14reads=$((lines / 4))

16# Write to summary

17echo "$fastq $reads" >> read_counts.txt

18done

20echo "Summary written to read_counts.txt"

Format Details

Loop files: Process all FASTQ in directory

Skip non-match: Handle case of no *.fastq files

Calculate: Count reads per file

Append results: >> appends to file

Input3.5sSuccess

./batch_process.sh

Output

Processing Sample_01.fastq...
Processing Sample_02.fastq...
Processing Sample_03.fastq...
Summary written to read_counts.txt

Process all FASTQ files automatically.

InputSuccess

cat read_counts.txt

Output

Sample_01.fastq	1234567
Sample_02.fastq	2345678
Sample_03.fastq	987654

Results compiled in tab-separated file.

Processing Paired-End Reads

process_pairs.sh

1#!/bin/bash

2# Process paired-end FASTQ files

4set -e

6for r1 in *_R1.fastq; do

7# Skip if no R1 files

8[ -f "$r1" ] || continue

10# Derive R2 and sample names

11r2=${r1/_R1.fastq/_R2.fastq}

12sample=${r1/_R1.fastq/}

14# Check R2 exists

15if [ ! -f "$r2" ]; then

16 echo "WARNING: Missing R2 for $sample, skipping"

17 continue

18fi

20echo "Processing $sample..."

21echo " R1: $r1"

22echo " R2: $r2"

24# Your analysis commands here

25# e.g., alignment, quality control, etc.

27done

29echo "Paired-end processing complete"

Format Details

Find R2: Replace _R1 with _R2 in filename

Sample ID: Remove _R1.fastq suffix

Verify pair: Check both files exist

Analysis: Add your pipeline here

InputSuccess

./process_pairs.sh

Output

Processing Sample_01...
R1: Sample_01_R1.fastq
R2: Sample_01_R2.fastq
Processing Sample_02...
R1: Sample_02_R1.fastq
R2: Sample_02_R2.fastq
Paired-end processing complete

Automatically match R1/R2 pairs.

Script Arguments vs Hard-Coded Values

Flexible vs Rigid Scripts

3 steps

# Bad: Hard-coded
#!/bin/bash
fastqc Sample_01.fastq

Debugging Scripts

InputSuccess

bash -x script.sh

Output

+ echo 'Starting analysis'
Starting analysis
+ sample=Sample_01
+ echo 'Processing Sample_01'

-x shows each command before executing. Invaluable for debugging.

InputSuccess

bash -n script.sh

-n checks syntax without running. Catches typos before execution.

Add debug output to your scripts:

Script with Debug Mode

1#!/bin/bash

2# Enable debug output with: DEBUG=1 ./script.sh

4if [ "$DEBUG" = "1" ]; then

5set -x # Enable command tracing

6fi

8sample=$1

9echo "Processing $sample"

Format Details

Debug check: Only enable if DEBUG set

Trace mode: Show commands when DEBUG=1

InputSuccess

DEBUG=1 ./script.sh Sample_01

Output

+ sample=Sample_01
+ echo 'Processing Sample_01'
Processing Sample_01

Set DEBUG=1 to see detailed execution.

Quick Reference

Script Basics

#!/bin/bash                  # Shebang line
chmod +x script.sh           # Make executable
./script.sh                  # Run script
bash script.sh               # Run with bash explicitly

Arguments

$0                          # Script name
$1, $2, $3                  # Individual arguments
$@                          # All arguments
$#                          # Number of arguments
"$@"                        # All arguments (proper quoting)

Exit and Error Handling

exit 0                      # Exit success
exit 1                      # Exit with error
$?                          # Last command's exit code
set -e                      # Exit on error
set -u                      # Exit on undefined variable
set -o pipefail             # Catch pipe errors

Script Structure

#!/bin/bash
# Description and usage
set -e -u -o pipefail
 
# Validate arguments
if [ $# -ne 1 ]; then
  echo "Usage: $0 <arg>"
  exit 1
fi
 
# Check inputs
if [ ! -f "$1" ]; then
  echo "ERROR: File not found"
  exit 1
fi
 
# Main logic here
 
exit 0

Best Practices Summary

Script Writing Checklist

Always include shebang - #!/bin/bash on line 1
Add usage documentation - Comments explaining what script does
Validate all arguments - Check count and types
Check file existence - Before reading/writing
Use descriptive variable names - sample_id not s
Quote variables - "$var" prevents word splitting
Exit with proper codes - 0 for success, 1+ for errors
Add safety flags - set -e, set -u, set -o pipefail
Make scripts executable - chmod +x script.sh
Test with edge cases - Missing files, empty inputs, etc.

Next Steps

You can now write scripts that:

Accept arguments for flexibility
Validate inputs before processing
Handle errors gracefully
Process multiple files automatically

The next page covers functions (reusable code blocks within scripts) and advanced debugging techniques to make your scripts even more robust and maintainable.