Evomics Docs
UNIX for Biologists/Introduction to UNIX for Biologists

Introduction to UNIX for Biologists

Why This Matters for Biology

Modern biology generates massive datasets. A single RNA-seq experiment produces hundreds of millions of sequencing reads. Whole genome sequencing generates gigabytes of data. You cannot point-and-click your way through terabytes of genomic data.

The UNIX command line is the universal interface for bioinformatics because it:

  • Scales effortlessly - Process one file or ten thousand with the same commands
  • Automates repetitive tasks - Script once, run on every sample
  • Handles massive files - Work with files too large to open in Excel
  • Integrates tools - Chain together specialized bioinformatics programs
  • Runs anywhere - From your laptop to HPC clusters to cloud platforms

Most bioinformatics tools are command-line only. Learning UNIX is not optional for modern biological research.

Real-World Example: RNA-seq Analysis

Consider a typical RNA-seq workflow. Every single step requires command-line tools - there is no graphical interface that can efficiently handle this scale.

RNA-seq Quality Control and Alignment

4 steps
fastqc -t 4 -o qc_reports/ sample_R1.fastq.gz sample_R2.fastq.gz
Output
Analysis complete for sample_R1.fastq.gz
Analysis complete for sample_R2.fastq.gz

Per base sequence quality: PASS
Per sequence quality scores: PASS
Adapter Content: WARN

The UNIX Philosophy

UNIX tools follow a simple philosophy:

UNIX Philosophy

Do one thing well, and make tools work together.

Instead of one massive program that does everything, UNIX provides small, focused tools that you combine. Each tool (grep, awk, cut) does one thing well. Combined together, they solve complex problems.

Input0.8sSuccess
grep -c '^>' TAIR10_genes.fasta
Output
27,655 sequences12.4 MB file size
27655

Count sequences in a FASTA file by counting header lines that start with '>'. The Arabidopsis TAIR10 genome contains 27,655 gene sequences.

Input1.2sSuccess
awk 'BEGIN{RS=">"} length($2) > 1000 {print ">"$0}' sequences.fasta | grep -c '^>'
Output
15,432 sequences > 1000bp
15432

Find sequences longer than 1000bp using awk. This pipeline changes the record separator to '>', calculates sequence length for each record, filters those over 1000bp, and counts the results.

What You Will Learn

This comprehensive guide takes you from complete beginner to advanced user:

Getting Started (You Are Here)

  • Why UNIX for bioinformatics
  • Terminal fundamentals
  • File system navigation
  • Basic commands

Text Processing

  • grep - Pattern matching in biological files
  • sed - Stream editing and text transformation
  • awk - Data processing and analysis
  • cut, sort, uniq - Data extraction and manipulation

Working with Biological Data

  • FASTA and FASTQ file processing
  • VCF variant file manipulation
  • GFF/GTF annotation files
  • SAM/BAM alignment files

Shell Scripting

  • Variables and control flow
  • Loops and conditionals
  • Functions and reusable code
  • Workflow automation

Advanced Topics

  • HPC cluster computing
  • Parallel processing
  • Performance optimization
  • Best practices

Prerequisites

None. This guide assumes no prior experience with UNIX or programming.

If you have used UNIX before, you can skip ahead to the sections that interest you. Each page is self-contained with clear examples.

For hands-on practice with immediate feedback, check out our interactive platform evomics-learn. This documentation provides comprehensive reference and advanced topics.

How to Use This Guide

Progressive Complexity

Each topic starts simple and builds to advanced usage. You can:

  • Read straight through for comprehensive learning
  • Jump to specific topics when you need them
  • Use as reference when you forget syntax

Code Examples

All examples use real biological data and are tested to work exactly as shown. You can copy and paste them directly into your terminal.

Input0.8sSuccess
grep -c '^>' TAIR10_genes.fasta
Output
27,655 sequences
27655

Count how many sequences are in a FASTA file. Every example in this guide uses real genomics data.

Biological Context

We never show generic examples with file.txt. Every example relates to real genomics workflows using standard file formats:

Common Biological File Formats

Examples of FASTA, FASTQ, and VCF formats with format-specific highlighting

2 records
1>AT1G01010.1 | NAC001 | NAC domain protein | chr1:3631-5899 REVERSE
2ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
3CTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTA
4GCTCACGGCTTTGTCGGGCAGATCATTGAGCTAGTAGGAGGTTTCACGGGCATCAACCAA
5>AT1G01020.1 | ARV1 | ARV1 family protein | chr1:6788-9130 FORWARD
6ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGCCCCACCTCTCTTCCCACCAA
7AATCCAATCAAAACGATAGTTTCTCCAACCAACCCATCTCCAACAACTTTAACTTCTTCT
Format Details
1
Header: Starts with '>'. Contains gene ID, symbol, description, and genomic location
2
Sequence: DNA/RNA/protein sequence in single-letter code, can span multiple lines

Practice While You Learn

Practice in evomics-learn

Try interactive UNIX exercises with instant validation

The evomics-learn platform provides:

  • Hands-on exercises with real genomics data
  • Instant validation of your commands
  • Progressive difficulty
  • No installation required (runs in browser)

Use this documentation for comprehensive reference, then practice on evomics-learn to solidify your understanding.

Getting Help

Throughout this guide you will find:

  • Tips - Best practices and shortcuts
  • Warnings - Common mistakes to avoid
  • Definitions - Clear explanations of technical terms
  • Examples - Working code you can copy

UNIX commands are case-sensitive. grep works, Grep does not.

Ready to Begin?

The next section covers terminal fundamentals: what the terminal is, how to navigate, and essential commands every bioinformatician needs.

Let's start your journey to command-line mastery!

Further Reading