Evomics Docs
UNIX for Biologists/Pattern Matching with grep

Pattern Matching with grep

The grep command searches files for patterns. For bioinformatics, grep is essential - it lets you extract specific sequences, filter variants, find genes of interest, and search through massive files without loading them into memory.

grep can search through gigabytes of data in seconds. Learn grep well, and you'll use it daily.

Basic grep

At its simplest, grep finds lines containing a pattern:

Input0.15sSuccess
grep 'ATGGCG' sequences.fasta
Output
2 matches
>AT1G01010.1 | NAC001 | NAC domain protein
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
>AT3G24500.1 | WRKY45 | WRKY DNA-binding protein
ATGGCGATGGCTTCACCTTCTTCTCCCCTCGCCGCCTTCCTCTCCAACTCCTTCAGAGAG

Find all lines containing the pattern 'ATGGCG'. grep shows the matching lines from the file.

grep searches for exact text matches by default. It prints every line that contains your pattern.

Syntax

InputSuccess
grep 'pattern' filename

Basic grep syntax: pattern (in quotes) followed by the filename to search.

Always quote your patterns. Special characters like $, *, and | have meaning to the shell. Quotes protect your pattern from shell interpretation.

Essential grep Options

Input0.23sSuccess
grep -i 'kinase' gene_annotations.gff
Output
3 matches
Chr1	TAIR10	gene	3631	5899	.	+	.	ID=AT1G01010;Name=NAC001;Note=protein kinase activity
Chr1	TAIR10	gene	6788	9130	.	+	.	ID=AT1G01020;Name=ARV1;Note=serine/threonine kinase
Chr2	TAIR10	gene	1025	3455	.	-	.	ID=AT2G01010;Name=PDK1;Note=Pyruvate dehydrogenase kinase

The -i flag ignores case. Finds 'kinase', 'Kinase', 'KINASE', etc.

Without -i, grep is case-sensitive:

InputSuccess
grep 'Kinase' gene_annotations.gff

No matches because all instances were lowercase 'kinase'. Case matters by default.

-v: Invert Match (Exclude Lines)

Input0.08sSuccess
grep -v '^#' variants.vcf | head -n 5
Output
Chr1	12345	rs123	A	G	99	PASS	DP=52	GT:DP	0/1:52
Chr1	23456	.	C	T	87	PASS	DP=48	GT:DP	1/1:48
Chr2	34567	rs456	G	A	95	PASS	DP=65	GT:DP	0/1:65
Chr2	45678	.	T	C	92	PASS	DP=58	GT:DP	0/1:58
Chr3	56789	rs789	C	G	88	PASS	DP=45	GT:DP	0/1:45

The -v flag inverts the match - shows lines that DON'T match. Here we exclude header lines (starting with #) from a VCF file.

Use grep -v to remove unwanted lines. Common use: grep -v '^#' removes comments, grep -v '^$' removes blank lines.

-c: Count Matches

Input0.35sSuccess
grep -c '^>' sequences.fasta
Output
27,655 sequences
27655

The -c flag counts matching lines instead of printing them. Count FASTA sequences by counting headers (lines starting with >).

Input12.5sSuccess
grep -c 'high_quality' alignment_summary.txt
Output
42,567,890 high-quality reads
42567890

Count how many reads passed quality filtering in an alignment report.

-n: Show Line Numbers

Input0.42sSuccess
grep -n 'stop_codon' genes.gtf | head -n 3
Output
3 matches shown
1234:Chr1	TAIR10	stop_codon	5897	5899	.	-	0	gene_id=AT1G01010
5678:Chr1	TAIR10	stop_codon	9128	9130	.	+	0	gene_id=AT1G01020
9012:Chr1	TAIR10	stop_codon	13712	13714	.	-	0	gene_id=AT1G01030

The -n flag adds line numbers before each match. Useful for referencing specific locations in large files.

-A, -B, -C: Context Lines

Get context around matches:

Input0.18sSuccess
grep -A 2 '^>AT1G01010' arabidopsis.fasta
Output
>AT1G01010.1 | NAC001 | NAC domain protein | chr1:3631-5899 REVERSE
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
CTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTA

The -A flag shows lines After the match. -A 2 shows the matching line plus 2 lines after. Get header plus first 2 sequence lines.

Input8.2sSuccess
grep -B 1 'MAPQ:60' alignments.sam | head -n 6
Output
SRR001666.1234	0	Chr1	3631	60	72M	*	0	0	ATCGATCG...
SRR001666.1234	0	Chr1	3631	60	72M	*	0	0	...	MAPQ:60
--
SRR001666.5678	0	Chr1	8456	60	72M	*	0	0	GCTAGCTA...
SRR001666.5678	0	Chr1	8456	60	72M	*	0	0	...	MAPQ:60

The -B flag shows lines Before the match. Useful for seeing what precedes interesting patterns. The -- separates match groups.

Input0.05sSuccess
grep -C 1 'ERROR' pipeline.log
Output
[2025-11-20 15:23:45] Processing sample_03
[2025-11-20 15:23:48] ERROR: Failed to open input file
[2025-11-20 15:23:48] Attempting retry with alternative path

The -C flag shows Context (lines before AND after). -C 1 shows 1 line before and 1 line after each match.

Regular Expressions

Regular expressions (regex) make grep incredibly powerful. They let you match patterns, not just exact text.

Basic Regex Characters

Essential Regular Expression Patterns

1# Anchors
2^pattern # Match at start of line
3pattern$ # Match at end of line
4^pattern$ # Match entire line exactly
5
6# Character classes
7. # Any single character
8[ABC] # Any one character: A, B, or C
9[A-Z] # Any uppercase letter
10[0-9] # Any digit
11[^ABC] # Any character EXCEPT A, B, or C
12
13# Quantifiers
14* # Zero or more of previous character
15+ # One or more of previous character (use grep -E)
16? # Zero or one of previous character (use grep -E)
17{n} # Exactly n occurrences (use grep -E)
18{n,m} # Between n and m occurrences (use grep -E)
19
20# Special sequences
21\t # Tab character
22\s # Whitespace (use grep -P for Perl regex)
23\w # Word character (use grep -P)
24\d # Digit (use grep -P)
25
26# Grouping
27(pattern) # Group patterns (use grep -E)
28pattern1|pattern2 # Match pattern1 OR pattern2 (use grep -E)
Format Details
1
Anchors: Specify position in the line
6
Characters: Match specific characters or ranges
13
Quantifiers: Specify how many times to match
20
Special: Match special character types
25
Logic: Group and combine patterns

Find Start Codons

Input0.28sSuccess
grep '^ATG' coding_sequences.fasta
Output
15,432 sequences with ATG start
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGCCCCACCTCTCTTCCCACCAA
ATGGCGGATGCTTCACCTTCTTCTCCCCTCGCCGCCTTCCTCTCCAACTCCTTCAGAGAG

The ^ anchor matches the start of a line. Find sequences beginning with ATG (start codon). Only matches sequence lines starting with ATG, not headers.

Find Stop Codons at Sequence End

Input0.45sSuccess
grep -E "(TAA|TAG|TGA)$" coding_sequences.fasta
Output
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATAA
CCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGTGA

The $ anchor matches line end. -E enables extended regex for | (OR). Find sequences ending with any stop codon.

Match Quality Scores

Input25.3sSuccess
grep '[I-J]' sample.fastq | wc -l
Output
198,756,432 bases with Q40+
198756432

Character class [I-J] matches Phred quality scores I or J (Q40-41, highest quality). Count high-quality bases in FASTQ file.

Practical Bioinformatics Examples

Extract Specific Genes from FASTA

Extract Gene Sequences by ID

3 steps
grep '^>AT1G0101' arabidopsis.fasta
Output
>AT1G01010.1 | NAC001 | NAC domain protein
>AT1G01015.1 | Unknown protein
>AT1G01018.1 | Short protein

Filter VCF by Chromosome

Input1.2sSuccess
grep '^Chr1' variants.vcf | wc -l
Output
245,678 Chr1 variants
245678

Count variants on chromosome 1. The ^ ensures we match Chr1 at the start, not in the middle of a line.

Input1.5sSuccess
grep -v '^#' variants.vcf | grep '^Chr1' | head -n 5 > chr1_variants.vcf

Extract first 5 Chr1 variants, excluding header lines. Combine grep commands to filter precisely.

Find High-Quality Alignments

Input45.2sSuccess
grep 'MAPQ:60' alignments.sam | wc -l
Output
38,234,567 perfect alignments
38234567

Count reads with perfect mapping quality (MAPQ 60) in a SAM file. These reads align to exactly one location.

Search Gene Annotations

Input0.18sSuccess
grep -i "zinc finger" gene_annotations.txt | cut -f1,2
Output
234 zinc finger genes
AT1G10480	C2H2-type zinc finger family protein
AT1G27730	C3HC4-type RING finger protein
AT1G51700	Zinc finger (C3HC4-type RING finger) protein
AT2G19130	B-box type zinc finger protein
AT3G46620	LSD1-like zinc finger protein

Find all genes with 'zinc finger' in their description. Case-insensitive search finds Zinc, zinc, ZINC, etc.

Extract FASTQ Reads by ID Pattern

Input2.3sSuccess
grep -A 3 '^@SRR001666.1[0-9][0-9][0-9] ' reads.fastq | head -n 20
Output
@SRR001666.1000 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR001666.1001 071112_SLXA-EAS1_s_7:5:1:818:346 length=72
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Extract FASTQ reads with IDs from 1000-1999. -A 3 gets the full 4-line FASTQ record. Pattern [0-9] matches any digit.

Combining grep with Pipes

grep's real power emerges when combined with other commands:

Count Sequences by Type

Input0.28sSuccess
grep '^>' sequences.fasta | grep -c 'mRNA'
Output
15,234 mRNA sequences
15234

First grep extracts all headers, second grep counts those mentioning mRNA. Pipe connects the commands.

Find and Sort Gene Names

Input0.85sSuccess
grep "^>" proteins.fasta | cut -d"|" -f2 | sort -u | head -n 10
Output
ARV1
DCL1
NAC001
NGA3
PPA1
WRKY45
ZFP1
ZFP2

Extract headers, extract gene names (second field after |), sort uniquely, show first 10. Chain multiple tools to process data.

Filter and Count Variants

Count High-Quality Variants per Chromosome

2 steps
grep -v '^#' high_confidence.vcf | grep 'PASS' | cut -f1 | sort | uniq -c
Output
  245678 Chr1
  198234 Chr2
  234567 Chr3
  189123 Chr4
  176234 Chr5

Extended Regex with grep -E

For more complex patterns, use grep -E (extended grep) which supports +, ?, |, and grouping:

Input0.32sSuccess
grep -E "^(ATG|GTG|TTG)" start_codons.fasta
Output
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
GTGAACACGAAGGACCACCAGATCACCCAAGTACCACCGC
TTGATGGCGGATGCTTCACCTTCTTCTCCCCTCGCCGCCTT

Find sequences starting with any of three alternative start codons. The | means OR, parentheses group the alternatives.

Input0.45sSuccess
grep -E "A{10,}" long_poly_a.fasta
Output
1,234 poly-A stretches
AAAAAAAAAAAAAATCGATCG
GCTAGCTAAAAAAAAAAAAAAAAGCTAGC
TTTTAAAAAAAAAAAAAAAAAAAAAAAGGG

Find sequences with 10 or more consecutive A's. The pattern means 10 or more repetitions. Useful for finding poly-A tails.

InputSuccess
grep -E "[GC]{3,}[AT]{3,}" gc_rich.fasta
Output
GGGCCCATTATCG
GCGCGCAAAATTT
CCCGGGTTAATTAA

Find sequences with 3+ G/C bases followed by 3+ A/T bases. Demonstrates complex pattern matching.

Recursive grep

Search through multiple files in a directory tree:

Input1.2sSuccess
grep -r 'kinase' annotations/
Output
annotations/chromosome1.gff:Chr1	TAIR10	gene	3631	5899	.	+	.	protein kinase
annotations/chromosome2.gff:Chr2	TAIR10	gene	8456	12345	.	-	.	tyrosine kinase
annotations/functions.txt:Protein kinase activity is essential for signaling

The -r flag searches recursively through all files in the annotations/ directory and its subdirectories.

Input3.5sSuccess
grep -r --include='*.gff' 'exon' annotations/ | wc -l
Output
456,789 exon annotations
456789

Search recursively but only in .gff files. --include filters which files to search.

Performance Tips for Large Files

Use -F for Fixed Strings

Input8.2sSuccess
grep -F 'ATCGATCGATCG' huge_file.fastq

The -F flag treats the pattern as a fixed string, not a regex. Much faster for exact matches in large files.

Without -F, grep tries to interpret the pattern as regex, which is slower.

Limit Output Early

Input2.3sSuccess
grep 'MAPQ:60' alignments.sam | head -n 1000 > high_quality.sam

Pipe to head to stop after finding enough matches. Don't process the entire file if you only need a subset.

Use -m to Stop After Matches

Input0.12sSuccess
grep -m 1000 '^>' sequences.fasta

The -m flag stops after finding N matches. Faster than grep + head because grep stops reading the file.

Common Mistakes

Forgetting to Quote Patterns

Input
grep protein kinase genes.txt
Output
grep: kinase: No such file or directory
grep: genes.txt: No such file or directory

Without quotes, the shell treats 'protein', 'kinase', and 'genes.txt' as three separate arguments. grep thinks kinase and genes.txt are filenames.

Solution:

InputSuccess
grep 'protein kinase' genes.txt
Output
AT1G01010	NAC domain protein kinase
AT2G03400	Serine/threonine protein kinase

Quotes treat 'protein kinase' as a single pattern to search for.

Not Escaping Special Characters

InputSuccess
grep 'Chr1:3631-5899' annotations.gff

The - in the range is treated as a regex character class. This pattern means 'Chr1:363' followed by any digit from 1-5, then '899'.

Solution:

InputSuccess
grep -F 'Chr1:3631-5899' annotations.gff
Output
Chr1	TAIR10	gene	3631	5899	.	+	.	ID=AT1G01010

Use -F to treat the pattern as a fixed string. Now the - is literal.

Forgetting Line Start/End Anchors

InputSuccess
grep 'Chr1' variants.vcf | head -n 3
Output
##reference=hg38/Chr1.fasta
Chr1	12345	rs123	A	G	99	PASS	DP=52
Chr10	23456	.	C	T	87	PASS	DP=48

Without ^, this matches Chr1 anywhere in the line, including Chr10, Chr11, etc. and header comments.

Solution:

InputSuccess
grep '^Chr1\\t' variants.vcf | head -n 3
Output
Chr1	12345	rs123	A	G	99	PASS	DP=52
Chr1	23456	.	C	T	87	PASS	DP=48
Chr1	34567	rs456	G	A	95	PASS	DP=65

Use ^ for line start and \\t for tab. Now this matches only Chr1 as the first column, not Chr10 or headers.

Quick Reference

grep Commands Cheat Sheet

1# Basic usage
2grep 'pattern' file # Search for pattern in file
3grep 'pattern' file1 file2 # Search multiple files
4grep -r 'pattern' directory/ # Recursive search
5
6# Common flags
7grep -i 'pattern' file # Case-insensitive
8grep -v 'pattern' file # Invert match (exclude)
9grep -c 'pattern' file # Count matches
10grep -n 'pattern' file # Show line numbers
11grep -A 2 'pattern' file # Show 2 lines after match
12grep -B 2 'pattern' file # Show 2 lines before match
13grep -C 2 'pattern' file # Show 2 lines context
14
15# Performance
16grep -F 'fixed_string' file # Faster for exact matches
17grep -m 100 'pattern' file # Stop after 100 matches
18
19# Extended regex
20grep -E 'pat1|pat2' file # OR patterns
21grep -E 'A{5,}' file # Quantifiers
22grep -E '^(ATG|GTG)' file # Grouping
23
24# Bioinformatics examples
25grep '^>' file.fasta # FASTA headers
26grep -c '^@' file.fastq # Count FASTQ reads (approx)
27grep -v '^#' file.vcf # Remove VCF headers
28grep 'MAPQ:60' file.sam # High-quality alignments
Format Details
1
Basic: Fundamental grep usage
6
Flags: Most commonly used options
16
Speed: Optimization for large files
20
Regex: Extended regular expressions with -E
25
Genomics: Common bioinformatics patterns

Best Practices

grep Best Practices
  1. Quote your patterns - Prevents shell interpretation of special characters
  2. Use -F for exact matches - Much faster when you don't need regex
  3. Anchor your patterns - Use ^ and $ to be specific about position
  4. Combine with other tools - grep | cut | sort | uniq is powerful
  5. Use -c for counting - Don't pipe to wc -l unnecessarily
  6. Test on small files first - Verify patterns work before processing huge files
  7. Consider memory - grep doesn't load files into memory, safe for huge files
  8. Learn basic regex - Time invested in regex pays off forever

Practice Exercises

Practice in evomics-learn

Practice grep commands with real genomics data

Try these exercises on evomics-learn:

  1. Extract genes by functional annotation
  2. Filter VCF files by quality and chromosome
  3. Find sequences with specific motifs
  4. Count features in GFF files
  5. Search logs for errors and warnings

Next Steps

Now that you can find patterns in files, the next section covers sed - stream editing. sed lets you transform text, replace patterns, and modify files without opening them in editors.

You'll learn:

  • Find and replace patterns
  • Delete specific lines
  • Transform file formats
  • Edit files in place
  • Chain sed commands for complex transformations

Further Reading