awk - Data Processing and Analysis

awk is a powerful text processing language built around fields (columns) and records (lines). For bioinformatics, awk excels at analyzing GFF files, filtering VCF variants, calculating statistics, and transforming any column-based data.

If your data has columns (tab-separated, comma-separated, or space-separated), awk is probably the best tool. Most genomics file formats are tabular, making awk indispensable for bioinformatics workflows.

Why awk for Bioinformatics?

Most genomics file formats are tabular:

GFF/GTF - Gene annotations with 9 columns
VCF - Variants with fixed columns plus samples
BED - Genomic coordinates with 3-12 columns
SAM - Alignments with 11+ columns
Count tables - Gene expression matrices

awk processes these files efficiently, performing calculations, filtering, and transformations that would require custom scripts in other languages.

Field Variables

Input0.01sSuccess

echo "Chr1 1000 2000 gene1" | awk '{print $1}'

Output

Chr1

$1 refers to the first field. awk splits on whitespace by default.

InputSuccess

echo "Chr1 1000 2000 gene1" | awk '{print $4}'

Output

gene1

$4 is the fourth field. Print the gene name from a BED-like format.

InputSuccess

echo "Chr1 1000 2000 gene1" | awk '{print $2, $3}'

Output

1000 2000

Print multiple fields. The comma creates a space in the output.

InputSuccess

echo "Chr1 1000 2000 gene1" | awk '{print $0}'

Output

Chr1 1000 2000 gene1

$0 refers to the entire line. Includes all fields.

Pattern Matching

InputSuccess

awk '/Chr1/' genes.bed

Output

Chr1	1000	2000	gene1
Chr1	5000	6000	gene2

Match lines containing pattern. Like grep, but with field processing available.

InputSuccess

awk '$3 > 5000' genes.bed

Output

Chr1	5000	6000	gene2
Chr2	7000	8000	gene4

Numeric comparison. Filter by column 3 greater than 5000.

InputSuccess

awk '$3 == "exon"' annotations.gff | head -n 3

Output

Chr1	TAIR10	exon	3631	3913	.	+	.	Parent=AT1G01010.1
Chr1	TAIR10	exon	3996	4276	.	+	.	Parent=AT1G01010.1

String match. Extract only exon features from GFF file.

Calculations

InputSuccess

awk '{print $3 - $2}' genes.bed | head -n 3

Output

1000
1000
2343

Calculate gene length: end position minus start position.

Input0.85sSuccess

awk '{sum += $2} END {print sum}' read_counts.txt

Output

52345678

Sum column 2. END block prints the final total after all lines.

Input1.2sSuccess

awk '{sum += $5; count++} END {print sum/count}' quality_scores.txt

Output

36.7

Calculate mean. Track sum and count, divide in END block.

Quick Reference

Common awk patterns for bioinformatics:

# Print columns
awk '{print $1}' file
awk '{print $1, $3}' file
 
# Pattern matching
awk '/pattern/' file
awk '$3 > 1000' file
awk '$2 == "exon"' file
 
# Calculations
awk '{sum += $2} END {print sum}' file
awk '{print $3 - $2}' file
 
# Field separator
awk -F "," '{print $1}' file.csv
awk -F "\t" '{print $1}' file.tsv

Next Steps

This page provides a foundation for awk. For comprehensive coverage including associative arrays, BEGIN blocks, multiple file processing, and complex bioinformatics examples, see the evomics-learn interactive tutorials.

Master awk alongside grep and sed for complete text processing capabilities.