Pattern Matching with grep
The grep command searches files for patterns. For bioinformatics, grep is essential - it lets you extract specific sequences, filter variants, find genes of interest, and search through massive files without loading them into memory.
grep can search through gigabytes of data in seconds. Learn grep well, and you'll use it daily.
Basic grep
At its simplest, grep finds lines containing a pattern:
grep 'ATGGCG' sequences.fasta>AT1G01010.1 | NAC001 | NAC domain protein
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
>AT3G24500.1 | WRKY45 | WRKY DNA-binding protein
ATGGCGATGGCTTCACCTTCTTCTCCCCTCGCCGCCTTCCTCTCCAACTCCTTCAGAGAGFind all lines containing the pattern 'ATGGCG'. grep shows the matching lines from the file.
grep searches for exact text matches by default. It prints every line that contains your pattern.
Syntax
grep 'pattern' filenameBasic grep syntax: pattern (in quotes) followed by the filename to search.
Always quote your patterns. Special characters like $, *, and | have meaning to the shell. Quotes protect your pattern from shell interpretation.
Essential grep Options
-i: Case-Insensitive Search
grep -i 'kinase' gene_annotations.gffChr1 TAIR10 gene 3631 5899 . + . ID=AT1G01010;Name=NAC001;Note=protein kinase activity
Chr1 TAIR10 gene 6788 9130 . + . ID=AT1G01020;Name=ARV1;Note=serine/threonine kinase
Chr2 TAIR10 gene 1025 3455 . - . ID=AT2G01010;Name=PDK1;Note=Pyruvate dehydrogenase kinaseThe -i flag ignores case. Finds 'kinase', 'Kinase', 'KINASE', etc.
Without -i, grep is case-sensitive:
grep 'Kinase' gene_annotations.gffNo matches because all instances were lowercase 'kinase'. Case matters by default.
-v: Invert Match (Exclude Lines)
grep -v '^#' variants.vcf | head -n 5Chr1 12345 rs123 A G 99 PASS DP=52 GT:DP 0/1:52
Chr1 23456 . C T 87 PASS DP=48 GT:DP 1/1:48
Chr2 34567 rs456 G A 95 PASS DP=65 GT:DP 0/1:65
Chr2 45678 . T C 92 PASS DP=58 GT:DP 0/1:58
Chr3 56789 rs789 C G 88 PASS DP=45 GT:DP 0/1:45The -v flag inverts the match - shows lines that DON'T match. Here we exclude header lines (starting with #) from a VCF file.
Use grep -v to remove unwanted lines. Common use: grep -v '^#' removes comments, grep -v '^$' removes blank lines.
-c: Count Matches
grep -c '^>' sequences.fasta27655The -c flag counts matching lines instead of printing them. Count FASTA sequences by counting headers (lines starting with >).
grep -c 'high_quality' alignment_summary.txt42567890Count how many reads passed quality filtering in an alignment report.
-n: Show Line Numbers
grep -n 'stop_codon' genes.gtf | head -n 31234:Chr1 TAIR10 stop_codon 5897 5899 . - 0 gene_id=AT1G01010
5678:Chr1 TAIR10 stop_codon 9128 9130 . + 0 gene_id=AT1G01020
9012:Chr1 TAIR10 stop_codon 13712 13714 . - 0 gene_id=AT1G01030The -n flag adds line numbers before each match. Useful for referencing specific locations in large files.
-A, -B, -C: Context Lines
Get context around matches:
grep -A 2 '^>AT1G01010' arabidopsis.fasta>AT1G01010.1 | NAC001 | NAC domain protein | chr1:3631-5899 REVERSE
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
CTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTAThe -A flag shows lines After the match. -A 2 shows the matching line plus 2 lines after. Get header plus first 2 sequence lines.
grep -B 1 'MAPQ:60' alignments.sam | head -n 6SRR001666.1234 0 Chr1 3631 60 72M * 0 0 ATCGATCG...
SRR001666.1234 0 Chr1 3631 60 72M * 0 0 ... MAPQ:60
--
SRR001666.5678 0 Chr1 8456 60 72M * 0 0 GCTAGCTA...
SRR001666.5678 0 Chr1 8456 60 72M * 0 0 ... MAPQ:60The -B flag shows lines Before the match. Useful for seeing what precedes interesting patterns. The -- separates match groups.
grep -C 1 'ERROR' pipeline.log[2025-11-20 15:23:45] Processing sample_03
[2025-11-20 15:23:48] ERROR: Failed to open input file
[2025-11-20 15:23:48] Attempting retry with alternative pathThe -C flag shows Context (lines before AND after). -C 1 shows 1 line before and 1 line after each match.
Regular Expressions
Regular expressions (regex) make grep incredibly powerful. They let you match patterns, not just exact text.
Basic Regex Characters
Essential Regular Expression Patterns
Find Start Codons
grep '^ATG' coding_sequences.fastaATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGCCCCACCTCTCTTCCCACCAA
ATGGCGGATGCTTCACCTTCTTCTCCCCTCGCCGCCTTCCTCTCCAACTCCTTCAGAGAGThe ^ anchor matches the start of a line. Find sequences beginning with ATG (start codon). Only matches sequence lines starting with ATG, not headers.
Find Stop Codons at Sequence End
grep -E "(TAA|TAG|TGA)$" coding_sequences.fastaGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATAA
CCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGTGAThe $ anchor matches line end. -E enables extended regex for | (OR). Find sequences ending with any stop codon.
Match Quality Scores
grep '[I-J]' sample.fastq | wc -l198756432Character class [I-J] matches Phred quality scores I or J (Q40-41, highest quality). Count high-quality bases in FASTQ file.
Practical Bioinformatics Examples
Extract Specific Genes from FASTA
Extract Gene Sequences by ID
3 stepsFilter VCF by Chromosome
grep '^Chr1' variants.vcf | wc -l245678Count variants on chromosome 1. The ^ ensures we match Chr1 at the start, not in the middle of a line.
grep -v '^#' variants.vcf | grep '^Chr1' | head -n 5 > chr1_variants.vcfExtract first 5 Chr1 variants, excluding header lines. Combine grep commands to filter precisely.
Find High-Quality Alignments
grep 'MAPQ:60' alignments.sam | wc -l38234567Count reads with perfect mapping quality (MAPQ 60) in a SAM file. These reads align to exactly one location.
Search Gene Annotations
grep -i "zinc finger" gene_annotations.txt | cut -f1,2AT1G10480 C2H2-type zinc finger family protein
AT1G27730 C3HC4-type RING finger protein
AT1G51700 Zinc finger (C3HC4-type RING finger) protein
AT2G19130 B-box type zinc finger protein
AT3G46620 LSD1-like zinc finger proteinFind all genes with 'zinc finger' in their description. Case-insensitive search finds Zinc, zinc, ZINC, etc.
Extract FASTQ Reads by ID Pattern
grep -A 3 '^@SRR001666.1[0-9][0-9][0-9] ' reads.fastq | head -n 20@SRR001666.1000 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR001666.1001 071112_SLXA-EAS1_s_7:5:1:818:346 length=72
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIExtract FASTQ reads with IDs from 1000-1999. -A 3 gets the full 4-line FASTQ record. Pattern [0-9] matches any digit.
Combining grep with Pipes
grep's real power emerges when combined with other commands:
Count Sequences by Type
grep '^>' sequences.fasta | grep -c 'mRNA'15234First grep extracts all headers, second grep counts those mentioning mRNA. Pipe connects the commands.
Find and Sort Gene Names
grep "^>" proteins.fasta | cut -d"|" -f2 | sort -u | head -n 10ARV1
DCL1
NAC001
NGA3
PPA1
WRKY45
ZFP1
ZFP2Extract headers, extract gene names (second field after |), sort uniquely, show first 10. Chain multiple tools to process data.
Filter and Count Variants
Count High-Quality Variants per Chromosome
2 stepsExtended Regex with grep -E
For more complex patterns, use grep -E (extended grep) which supports +, ?, |, and grouping:
grep -E "^(ATG|GTG|TTG)" start_codons.fastaATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
GTGAACACGAAGGACCACCAGATCACCCAAGTACCACCGC
TTGATGGCGGATGCTTCACCTTCTTCTCCCCTCGCCGCCTTFind sequences starting with any of three alternative start codons. The | means OR, parentheses group the alternatives.
grep -E "A{10,}" long_poly_a.fastaAAAAAAAAAAAAAATCGATCG
GCTAGCTAAAAAAAAAAAAAAAAGCTAGC
TTTTAAAAAAAAAAAAAAAAAAAAAAAGGGFind sequences with 10 or more consecutive A's. The pattern means 10 or more repetitions. Useful for finding poly-A tails.
grep -E "[GC]{3,}[AT]{3,}" gc_rich.fastaGGGCCCATTATCG
GCGCGCAAAATTT
CCCGGGTTAATTAAFind sequences with 3+ G/C bases followed by 3+ A/T bases. Demonstrates complex pattern matching.
Recursive grep
Search through multiple files in a directory tree:
grep -r 'kinase' annotations/annotations/chromosome1.gff:Chr1 TAIR10 gene 3631 5899 . + . protein kinase
annotations/chromosome2.gff:Chr2 TAIR10 gene 8456 12345 . - . tyrosine kinase
annotations/functions.txt:Protein kinase activity is essential for signalingThe -r flag searches recursively through all files in the annotations/ directory and its subdirectories.
grep -r --include='*.gff' 'exon' annotations/ | wc -l456789Search recursively but only in .gff files. --include filters which files to search.
Performance Tips for Large Files
Use -F for Fixed Strings
grep -F 'ATCGATCGATCG' huge_file.fastqThe -F flag treats the pattern as a fixed string, not a regex. Much faster for exact matches in large files.
Without -F, grep tries to interpret the pattern as regex, which is slower.
Limit Output Early
grep 'MAPQ:60' alignments.sam | head -n 1000 > high_quality.samPipe to head to stop after finding enough matches. Don't process the entire file if you only need a subset.
Use -m to Stop After Matches
grep -m 1000 '^>' sequences.fastaThe -m flag stops after finding N matches. Faster than grep + head because grep stops reading the file.
Common Mistakes
Forgetting to Quote Patterns
grep protein kinase genes.txtgrep: kinase: No such file or directory
grep: genes.txt: No such file or directoryWithout quotes, the shell treats 'protein', 'kinase', and 'genes.txt' as three separate arguments. grep thinks kinase and genes.txt are filenames.
Solution:
grep 'protein kinase' genes.txtAT1G01010 NAC domain protein kinase
AT2G03400 Serine/threonine protein kinaseQuotes treat 'protein kinase' as a single pattern to search for.
Not Escaping Special Characters
grep 'Chr1:3631-5899' annotations.gffThe - in the range is treated as a regex character class. This pattern means 'Chr1:363' followed by any digit from 1-5, then '899'.
Solution:
grep -F 'Chr1:3631-5899' annotations.gffChr1 TAIR10 gene 3631 5899 . + . ID=AT1G01010Use -F to treat the pattern as a fixed string. Now the - is literal.
Forgetting Line Start/End Anchors
grep 'Chr1' variants.vcf | head -n 3##reference=hg38/Chr1.fasta
Chr1 12345 rs123 A G 99 PASS DP=52
Chr10 23456 . C T 87 PASS DP=48Without ^, this matches Chr1 anywhere in the line, including Chr10, Chr11, etc. and header comments.
Solution:
grep '^Chr1\\t' variants.vcf | head -n 3Chr1 12345 rs123 A G 99 PASS DP=52
Chr1 23456 . C T 87 PASS DP=48
Chr1 34567 rs456 G A 95 PASS DP=65Use ^ for line start and \\t for tab. Now this matches only Chr1 as the first column, not Chr10 or headers.
Quick Reference
grep Commands Cheat Sheet
Best Practices
- Quote your patterns - Prevents shell interpretation of special characters
- Use -F for exact matches - Much faster when you don't need regex
- Anchor your patterns - Use ^ and $ to be specific about position
- Combine with other tools - grep | cut | sort | uniq is powerful
- Use -c for counting - Don't pipe to wc -l unnecessarily
- Test on small files first - Verify patterns work before processing huge files
- Consider memory - grep doesn't load files into memory, safe for huge files
- Learn basic regex - Time invested in regex pays off forever
Practice Exercises
Practice grep commands with real genomics data
Try these exercises on evomics-learn:
- Extract genes by functional annotation
- Filter VCF files by quality and chromosome
- Find sequences with specific motifs
- Count features in GFF files
- Search logs for errors and warnings
Next Steps
Now that you can find patterns in files, the next section covers sed - stream editing. sed lets you transform text, replace patterns, and modify files without opening them in editors.
You'll learn:
- Find and replace patterns
- Delete specific lines
- Transform file formats
- Edit files in place
- Chain sed commands for complex transformations