awk - Data Processing and Analysis
awk is a powerful text processing language built around fields (columns) and records (lines). For bioinformatics, awk excels at analyzing GFF files, filtering VCF variants, calculating statistics, and transforming any column-based data.
If your data has columns (tab-separated, comma-separated, or space-separated), awk is probably the best tool. Most genomics file formats are tabular, making awk indispensable for bioinformatics workflows.
Why awk for Bioinformatics?
Most genomics file formats are tabular:
- GFF/GTF - Gene annotations with 9 columns
- VCF - Variants with fixed columns plus samples
- BED - Genomic coordinates with 3-12 columns
- SAM - Alignments with 11+ columns
- Count tables - Gene expression matrices
awk processes these files efficiently, performing calculations, filtering, and transformations that would require custom scripts in other languages.
Field Variables
echo "Chr1 1000 2000 gene1" | awk '{print $1}'Chr1$1 refers to the first field. awk splits on whitespace by default.
echo "Chr1 1000 2000 gene1" | awk '{print $4}'gene1$4 is the fourth field. Print the gene name from a BED-like format.
echo "Chr1 1000 2000 gene1" | awk '{print $2, $3}'1000 2000Print multiple fields. The comma creates a space in the output.
echo "Chr1 1000 2000 gene1" | awk '{print $0}'Chr1 1000 2000 gene1$0 refers to the entire line. Includes all fields.
Pattern Matching
awk '/Chr1/' genes.bedChr1 1000 2000 gene1
Chr1 5000 6000 gene2Match lines containing pattern. Like grep, but with field processing available.
awk '$3 > 5000' genes.bedChr1 5000 6000 gene2
Chr2 7000 8000 gene4Numeric comparison. Filter by column 3 greater than 5000.
awk '$3 == "exon"' annotations.gff | head -n 3Chr1 TAIR10 exon 3631 3913 . + . Parent=AT1G01010.1
Chr1 TAIR10 exon 3996 4276 . + . Parent=AT1G01010.1String match. Extract only exon features from GFF file.
Calculations
awk '{print $3 - $2}' genes.bed | head -n 31000
1000
2343Calculate gene length: end position minus start position.
awk '{sum += $2} END {print sum}' read_counts.txt52345678Sum column 2. END block prints the final total after all lines.
awk '{sum += $5; count++} END {print sum/count}' quality_scores.txt36.7Calculate mean. Track sum and count, divide in END block.
Quick Reference
Common awk patterns for bioinformatics:
# Print columns
awk '{print $1}' file
awk '{print $1, $3}' file
# Pattern matching
awk '/pattern/' file
awk '$3 > 1000' file
awk '$2 == "exon"' file
# Calculations
awk '{sum += $2} END {print sum}' file
awk '{print $3 - $2}' file
# Field separator
awk -F "," '{print $1}' file.csv
awk -F "\t" '{print $1}' file.tsvNext Steps
This page provides a foundation for awk. For comprehensive coverage including associative arrays, BEGIN blocks, multiple file processing, and complex bioinformatics examples, see the evomics-learn interactive tutorials.
Master awk alongside grep and sed for complete text processing capabilities.
Further Reading
- Data Manipulation Tools - Next topic
- GNU awk Manual
- awk One-Liners