Evomics Docs
UNIX for Biologists/awk - Data Processing and Analysis

awk - Data Processing and Analysis

awk is a powerful text processing language built around fields (columns) and records (lines). For bioinformatics, awk excels at analyzing GFF files, filtering VCF variants, calculating statistics, and transforming any column-based data.

If your data has columns (tab-separated, comma-separated, or space-separated), awk is probably the best tool. Most genomics file formats are tabular, making awk indispensable for bioinformatics workflows.

Why awk for Bioinformatics?

Most genomics file formats are tabular:

  • GFF/GTF - Gene annotations with 9 columns
  • VCF - Variants with fixed columns plus samples
  • BED - Genomic coordinates with 3-12 columns
  • SAM - Alignments with 11+ columns
  • Count tables - Gene expression matrices

awk processes these files efficiently, performing calculations, filtering, and transformations that would require custom scripts in other languages.

Field Variables

Input0.01sSuccess
echo "Chr1 1000 2000 gene1" | awk '{print $1}'
Output
Chr1

$1 refers to the first field. awk splits on whitespace by default.

InputSuccess
echo "Chr1 1000 2000 gene1" | awk '{print $4}'
Output
gene1

$4 is the fourth field. Print the gene name from a BED-like format.

InputSuccess
echo "Chr1 1000 2000 gene1" | awk '{print $2, $3}'
Output
1000 2000

Print multiple fields. The comma creates a space in the output.

InputSuccess
echo "Chr1 1000 2000 gene1" | awk '{print $0}'
Output
Chr1 1000 2000 gene1

$0 refers to the entire line. Includes all fields.

Pattern Matching

InputSuccess
awk '/Chr1/' genes.bed
Output
Chr1	1000	2000	gene1
Chr1	5000	6000	gene2

Match lines containing pattern. Like grep, but with field processing available.

InputSuccess
awk '$3 > 5000' genes.bed
Output
Chr1	5000	6000	gene2
Chr2	7000	8000	gene4

Numeric comparison. Filter by column 3 greater than 5000.

InputSuccess
awk '$3 == "exon"' annotations.gff | head -n 3
Output
Chr1	TAIR10	exon	3631	3913	.	+	.	Parent=AT1G01010.1
Chr1	TAIR10	exon	3996	4276	.	+	.	Parent=AT1G01010.1

String match. Extract only exon features from GFF file.

Calculations

InputSuccess
awk '{print $3 - $2}' genes.bed | head -n 3
Output
1000
1000
2343

Calculate gene length: end position minus start position.

Input0.85sSuccess
awk '{sum += $2} END {print sum}' read_counts.txt
Output
52345678

Sum column 2. END block prints the final total after all lines.

Input1.2sSuccess
awk '{sum += $5; count++} END {print sum/count}' quality_scores.txt
Output
36.7

Calculate mean. Track sum and count, divide in END block.

Quick Reference

Common awk patterns for bioinformatics:

# Print columns awk '{print $1}' file awk '{print $1, $3}' file # Pattern matching awk '/pattern/' file awk '$3 > 1000' file awk '$2 == "exon"' file # Calculations awk '{sum += $2} END {print sum}' file awk '{print $3 - $2}' file # Field separator awk -F "," '{print $1}' file.csv awk -F "\t" '{print $1}' file.tsv

Next Steps

This page provides a foundation for awk. For comprehensive coverage including associative arrays, BEGIN blocks, multiple file processing, and complex bioinformatics examples, see the evomics-learn interactive tutorials.

Master awk alongside grep and sed for complete text processing capabilities.

Further Reading