Introduction to UNIX for Biologists

Why This Matters for Biology

Modern biology generates massive datasets. A single RNA-seq experiment produces hundreds of millions of sequencing reads. Whole genome sequencing generates gigabytes of data. You cannot point-and-click your way through terabytes of genomic data.

The UNIX command line is the universal interface for bioinformatics because it:

Scales effortlessly - Process one file or ten thousand with the same commands
Automates repetitive tasks - Script once, run on every sample
Handles massive files - Work with files too large to open in Excel
Integrates tools - Chain together specialized bioinformatics programs
Runs anywhere - From your laptop to HPC clusters to cloud platforms

Most bioinformatics tools are command-line only. Learning UNIX is not optional for modern biological research.

Real-World Example: RNA-seq Analysis

Consider a typical RNA-seq workflow. Every single step requires command-line tools - there is no graphical interface that can efficiently handle this scale.

RNA-seq Quality Control and Alignment

4 steps

fastqc -t 4 -o qc_reports/ sample_R1.fastq.gz sample_R2.fastq.gz

Output

Analysis complete for sample_R1.fastq.gz
Analysis complete for sample_R2.fastq.gz

Per base sequence quality: PASS
Per sequence quality scores: PASS
Adapter Content: WARN

The UNIX Philosophy

UNIX tools follow a simple philosophy:

UNIX Philosophy

Do one thing well, and make tools work together.

Instead of one massive program that does everything, UNIX provides small, focused tools that you combine. Each tool (grep, awk, cut) does one thing well. Combined together, they solve complex problems.

Input0.8sSuccess

grep -c '^>' TAIR10_genes.fasta

Output

27,655 sequences12.4 MB file size

Count sequences in a FASTA file by counting header lines that start with '>'. The Arabidopsis TAIR10 genome contains 27,655 gene sequences.

Input1.2sSuccess

awk 'BEGIN{RS=">"} length($2) > 1000 {print ">"$0}' sequences.fasta | grep -c '^>'

Output

15,432 sequences > 1000bp

Find sequences longer than 1000bp using awk. This pipeline changes the record separator to '>', calculates sequence length for each record, filters those over 1000bp, and counts the results.

What You Will Learn

This comprehensive guide takes you from complete beginner to advanced user:

Getting Started (You Are Here)

Why UNIX for bioinformatics
Terminal fundamentals
File system navigation
Basic commands

Text Processing

grep - Pattern matching in biological files
sed - Stream editing and text transformation
awk - Data processing and analysis
cut, sort, uniq - Data extraction and manipulation

Working with Biological Data

FASTA and FASTQ file processing
VCF variant file manipulation
GFF/GTF annotation files
SAM/BAM alignment files

Shell Scripting

Variables and control flow
Loops and conditionals
Functions and reusable code
Workflow automation

Advanced Topics

HPC cluster computing
Parallel processing
Performance optimization
Best practices

Prerequisites

None. This guide assumes no prior experience with UNIX or programming.

If you have used UNIX before, you can skip ahead to the sections that interest you. Each page is self-contained with clear examples.

For hands-on practice with immediate feedback, check out our interactive platform evomics-learn. This documentation provides comprehensive reference and advanced topics.

How to Use This Guide

Progressive Complexity

Each topic starts simple and builds to advanced usage. You can:

Read straight through for comprehensive learning
Jump to specific topics when you need them
Use as reference when you forget syntax

Code Examples

All examples use real biological data and are tested to work exactly as shown. You can copy and paste them directly into your terminal.

Input0.8sSuccess

grep -c '^>' TAIR10_genes.fasta

Output

27,655 sequences

Count how many sequences are in a FASTA file. Every example in this guide uses real genomics data.

Biological Context

We never show generic examples with file.txt. Every example relates to real genomics workflows using standard file formats:

Common Biological File Formats

Examples of FASTA, FASTQ, and VCF formats with format-specific highlighting

2 records

1>AT1G01010.1 | NAC001 | NAC domain protein | chr1:3631-5899 REVERSE

2ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT

3CTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTA

4GCTCACGGCTTTGTCGGGCAGATCATTGAGCTAGTAGGAGGTTTCACGGGCATCAACCAA

5>AT1G01020.1 | ARV1 | ARV1 family protein | chr1:6788-9130 FORWARD

6ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGCCCCACCTCTCTTCCCACCAA

7AATCCAATCAAAACGATAGTTTCTCCAACCAACCCATCTCCAACAACTTTAACTTCTTCT

Format Details

Header: Starts with '>'. Contains gene ID, symbol, description, and genomic location

Sequence: DNA/RNA/protein sequence in single-letter code, can span multiple lines

Practice While You Learn

Practice in evomics-learn

Try interactive UNIX exercises with instant validation

The evomics-learn platform provides:

Hands-on exercises with real genomics data
Instant validation of your commands
Progressive difficulty
No installation required (runs in browser)

Use this documentation for comprehensive reference, then practice on evomics-learn to solidify your understanding.

Getting Help

Throughout this guide you will find:

Tips - Best practices and shortcuts
Warnings - Common mistakes to avoid
Definitions - Clear explanations of technical terms
Examples - Working code you can copy

UNIX commands are case-sensitive. grep works, Grep does not.

Ready to Begin?

The next section covers terminal fundamentals: what the terminal is, how to navigate, and essential commands every bioinformatician needs.

Let's start your journey to command-line mastery!