Evomics Docs
UNIX for Biologists/Variables and Control Flow

Variables and Control Flow

Shell scripting transforms individual commands into automated workflows. Variables store data, conditionals make decisions, and loops process multiple items - essential skills for analyzing genomics datasets.

Think of shell scripting as writing recipes for your data analysis. Instead of manually processing 100 FASTQ files, write a script once and let it run automatically.

Variables

Variables store values you want to reuse. In genomics, this might be file paths, sample names, or parameters.

Creating Variables

InputSuccess
sample_id=Sample_01

Create a variable. No spaces around the = sign!

Common mistake: sample_id = Sample_01 will fail. Shell interprets this as running a command called sample_id with arguments = and Sample_01. Never use spaces around = when assigning variables.

Using Variables

InputSuccess
sample_id=Sample_01
echo $sample_id
Output
Sample_01

Access variable value with $. The $ expands the variable to its value.

InputSuccess
sample_id=Sample_01
echo "Processing ${sample_id}_R1.fastq"
Output
Processing Sample_01_R1.fastq

Use ${variable} syntax when adjacent to other text. Prevents ambiguity.

Quotes Matter

InputSuccess
message="Hello World"
echo $message
Output
Hello World

Double quotes preserve the value. Spaces are kept together.

InputSuccess
files="*.fastq"
echo $files
Output
*.fastq

Double quotes prevent glob expansion. The * remains literal.

InputSuccess
name=Sample_01
echo "Processing $name"
Output
Processing Sample_01

Variables expand inside double quotes. Allows building strings.

InputSuccess
name=Sample_01
echo 'Processing $name'
Output
Processing $name

Single quotes prevent all expansion. The $name stays literal.

Command Substitution

Capture command output into variables:

InputSuccess
count=$(wc -l < genes.txt)
echo "Found $count genes"
Output
Found 15234 genes

$() runs the command and captures its output. Store results for later use.

InputSuccess
date=$(date +%Y-%m-%d)
echo "Analysis started on $date"
Output
Analysis started on 2024-11-20

Capture the current date. Useful for timestamping results.

InputSuccess
total=$(awk '{sum += $2} END {print sum}' counts.txt)
echo "Total reads: $total"
Output
Total reads: 52345678

Store calculation results. Process data and save the answer.

Practical Variable Examples

Process Sample with Variables

4 steps
sample=Sample_01

If Statements

Make decisions based on conditions. Essential for error checking and conditional processing.

Basic If Syntax

InputSuccess
if [ -f sequences.fasta ]; then
  echo "File exists"
fi
Output
File exists

Check if file exists before processing. Prevents errors from missing files.

If Statement Syntax

if [ condition ]; then - Start conditional block. Spaces around brackets required!

then - Marks the beginning of commands to run when true.

fi - End of if statement (if backwards).

Important: Spaces are mandatory: [ -f file ] works, [-f file] fails.

File Tests

InputSuccess
if [ -f data.txt ]; then
  echo "Regular file exists"
fi
Output
Regular file exists

-f tests for regular file existence. Most common file check.

InputSuccess
if [ -d results ]; then
  echo "Directory exists"
fi
Output
Directory exists

-d tests for directory existence. Check before creating output directories.

InputSuccess
if [ -s sequences.fastq ]; then
  echo "File exists and is not empty"
fi
Output
File exists and is not empty

-s tests file has content. Catch empty output files early.

If-Else

InputSuccess
if [ -f results.txt ]; then
  echo "Results found"
else
  echo "No results - running analysis"
fi
Output
No results - running analysis

else handles the false case. Take different actions based on condition.

If-Elif-Else

InputSuccess
quality=35
if [ $quality -ge 40 ]; then
  echo "Excellent quality"
elif [ $quality -ge 30 ]; then
  echo "Good quality"
else
  echo "Poor quality"
fi
Output
Good quality

Multiple conditions with elif. Categorize data into bins.

Numeric Comparisons

InputSuccess
count=100
if [ $count -gt 50 ]; then
  echo "More than 50"
fi
Output
More than 50

-gt means greater than. Other operators: -lt (less than), -eq (equal), -ne (not equal), -ge (greater or equal), -le (less or equal).

String Comparisons

InputSuccess
filetype=fastq
if [ "$filetype" = "fastq" ]; then
  echo "Processing FASTQ file"
fi
Output
Processing FASTQ file

= tests string equality. Use quotes around variables to handle empty values safely.

InputSuccess
if [ -z "$empty_var" ]; then
  echo "Variable is empty"
fi
Output
Variable is empty

-z tests if string is empty. Check required variables are set.

For Loops

Process multiple items automatically. The foundation of batch processing in bioinformatics.

Loop Over List

InputSuccess
for sample in Sample_01 Sample_02 Sample_03; do
  echo "Processing $sample"
done
Output
Processing Sample_01
Processing Sample_02
Processing Sample_03

Loop through explicit list. The variable sample takes each value in turn.

Loop Over Files

InputSuccess
for file in *.fastq; do
  echo "Found: $file"
done
Output
Found: Sample_01_R1.fastq
Found: Sample_01_R2.fastq
Found: Sample_02_R1.fastq

Loop through files matching pattern. Process all FASTQs automatically.

If no files match the pattern, the loop will run once with the literal string *.fastq. Always check if files exist first, or use shopt -s nullglob to skip loops when no matches.

Process Multiple Samples

Quality Check All Samples

1 step
for fastq in *.fastq; do
  lines=$(wc -l < "$fastq")
  reads=$((lines / 4))
  echo "$fastq: $reads reads"
done
Output
Sample_01.fastq: 1000000 reads
Sample_02.fastq: 1234567 reads
Sample_03.fastq: 987654 reads

Arithmetic in Loops

InputSuccess
for i in {1..5}; do
  echo "Processing batch $i"
done
Output
Processing batch 1
Processing batch 2
Processing batch 3
Processing batch 4
Processing batch 5

{1..5} generates sequence 1 through 5. Useful for numbered samples.

InputSuccess
for i in {01..03}; do
  echo "Sample_$i"
done
Output
Sample_01
Sample_02
Sample_03

Use leading zeros for consistent naming. {01..03} preserves padding.

While Loops

Repeat while a condition is true. Useful for reading files line by line.

Basic While Loop

InputSuccess
counter=1
while [ $counter -le 3 ]; do
  echo "Iteration $counter"
  counter=$((counter + 1))
done
Output
Iteration 1
Iteration 2
Iteration 3

Loop while condition is true. Increment counter each time.

Read File Line by Line

InputSuccess
while read -r sample; do
  echo "Processing: $sample"
done < samples.txt
Output
Processing: Sample_01
Processing: Sample_02
Processing: Sample_03

Read each line into variable sample. Process sample list from file.

Always use read -r - The -r flag prevents backslash interpretation, preserving the exact line content. Without it, paths like C:\data\files would be corrupted.

Process Samples from List

Analyze Samples from File

2 steps
cat samples.txt
Output
Sample_01
Sample_02
Sample_03

Combining Conditionals and Loops

Real workflows combine these tools:

Quality Filter Multiple Samples

1 step
for fastq in *.fastq; do
  if [ -f "$fastq" ]; then
    lines=$(wc -l < "$fastq")
    reads=$((lines / 4))
    
    if [ $reads -gt 1000000 ]; then
      echo "$fastq: PASS ($reads reads)"
    else
      echo "$fastq: FAIL ($reads reads - too few)"
    fi
  fi
done
Output
Sample_01.fastq: PASS (1234567 reads)
Sample_02.fastq: FAIL (456789 reads - too few)
Sample_03.fastq: PASS (2345678 reads)

Practical Bioinformatics Example

Batch Sequence Length Analysis

1 step
for fasta in *.fasta; do
  echo "Analyzing $fasta"
  
  total=$(grep -v "^>" "$fasta" | tr -d '\n' | wc -c)
  seqs=$(grep -c "^>" "$fasta")
  
  if [ $seqs -gt 0 ]; then
    avg=$((total / seqs))
    echo "  Sequences: $seqs"
    echo "  Average length: $avg bp"
  else
    echo "  WARNING: No sequences found"
  fi
  echo ""
done
Output
Analyzing genes.fasta
  Sequences: 27655
  Average length: 1543 bp

Analyzing proteins.fasta
  Sequences: 35386
  Average length: 387 bp

Analyzing empty.fasta
  WARNING: No sequences found

Common Patterns

Check Required Files Exist

InputSuccess
required_files="sample.fastq reference.fasta annotation.gff"

for file in $required_files; do
  if [ ! -f "$file" ]; then
    echo "ERROR: Missing required file: $file"
    exit 1
  fi
done

echo "All required files present"
Output
All required files present

Validate inputs before analysis. Exit early if anything is missing.

Create Output Directories

InputSuccess
for dir in results logs temp; do
  if [ ! -d "$dir" ]; then
    mkdir "$dir"
    echo "Created $dir/"
  fi
done
Output
Created results/
Created logs/
Created temp/

Ensure output directories exist. Create only if needed.

Process Paired-End Reads

InputSuccess
for r1 in *_R1.fastq; do
  r2=${r1/_R1.fastq/_R2.fastq}
  sample=${r1/_R1.fastq/}
  
  if [ -f "$r2" ]; then
    echo "Processing pair: $sample"
    echo "  R1: $r1"
    echo "  R2: $r2"
  else
    echo "WARNING: Missing R2 for $sample"
  fi
done
Output
Processing pair: Sample_01
R1: Sample_01_R1.fastq
R2: Sample_01_R2.fastq
Processing pair: Sample_02
R1: Sample_02_R1.fastq
R2: Sample_02_R2.fastq

Match R1 and R2 files. Use string substitution to find pairs.

Variable Naming Best Practices

Naming Conventions
  1. Use descriptive names - sample_id not s
  2. Lowercase for custom variables - output_dir not OUTPUT_DIR
  3. UPPERCASE for environment variables - PATH, HOME
  4. Underscores for readability - gene_count not genecount
  5. Avoid reserved words - Don't use test, if, for, while as variable names

Quick Reference

Variable Operations

var=value # Assign variable (no spaces!) $var or ${var} # Use variable "$var" # Expand in double quotes '$var' # Literal in single quotes $(command) # Command substitution ${var:-default} # Use default if var is unset

Conditionals

if [ condition ]; then # If statement commands elif [ condition ]; then # Else if commands else # Else commands fi # End if

File Tests

[ -f file ] # File exists [ -d dir ] # Directory exists [ -s file ] # File not empty [ -r file ] # File readable [ -w file ] # File writable [ -x file ] # File executable

Numeric Comparisons

[ $a -eq $b ] # Equal [ $a -ne $b ] # Not equal [ $a -lt $b ] # Less than [ $a -le $b ] # Less or equal [ $a -gt $b ] # Greater than [ $a -ge $b ] # Greater or equal

Loops

for var in list; do # For loop commands done while [ condition ]; do # While loop commands done while read -r line; do # Read file commands done < file

Next Steps

You now have the building blocks for automation:

  • Variables to store data
  • Conditionals to make decisions
  • Loops to process multiple items

The next page covers turning these commands into reusable shell scripts with proper structure, arguments, and error handling.

Further Reading