Pattern Matching with grep

The grep command searches files for patterns. For bioinformatics, grep is essential - it lets you extract specific sequences, filter variants, find genes of interest, and search through massive files without loading them into memory.

grep can search through gigabytes of data in seconds. Learn grep well, and you'll use it daily.

Basic grep

At its simplest, grep finds lines containing a pattern:

Input0.15sSuccess

grep 'ATGGCG' sequences.fasta

Output

2 matches

>AT1G01010.1 | NAC001 | NAC domain protein
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
>AT3G24500.1 | WRKY45 | WRKY DNA-binding protein
ATGGCGATGGCTTCACCTTCTTCTCCCCTCGCCGCCTTCCTCTCCAACTCCTTCAGAGAG

Find all lines containing the pattern 'ATGGCG'. grep shows the matching lines from the file.

grep searches for exact text matches by default. It prints every line that contains your pattern.

Syntax

InputSuccess

grep 'pattern' filename

Basic grep syntax: pattern (in quotes) followed by the filename to search.

Always quote your patterns. Special characters like $, *, and | have meaning to the shell. Quotes protect your pattern from shell interpretation.

Essential grep Options

-i: Case-Insensitive Search

Input0.23sSuccess

grep -i 'kinase' gene_annotations.gff

Output

3 matches

Chr1	TAIR10	gene	3631	5899	.	+	.	ID=AT1G01010;Name=NAC001;Note=protein kinase activity
Chr1	TAIR10	gene	6788	9130	.	+	.	ID=AT1G01020;Name=ARV1;Note=serine/threonine kinase
Chr2	TAIR10	gene	1025	3455	.	-	.	ID=AT2G01010;Name=PDK1;Note=Pyruvate dehydrogenase kinase

The -i flag ignores case. Finds 'kinase', 'Kinase', 'KINASE', etc.

Without -i, grep is case-sensitive:

InputSuccess

grep 'Kinase' gene_annotations.gff

No matches because all instances were lowercase 'kinase'. Case matters by default.

-v: Invert Match (Exclude Lines)

Input0.08sSuccess

grep -v '^#' variants.vcf | head -n 5

Output

Chr1	12345	rs123	A	G	99	PASS	DP=52	GT:DP	0/1:52
Chr1	23456	.	C	T	87	PASS	DP=48	GT:DP	1/1:48
Chr2	34567	rs456	G	A	95	PASS	DP=65	GT:DP	0/1:65
Chr2	45678	.	T	C	92	PASS	DP=58	GT:DP	0/1:58
Chr3	56789	rs789	C	G	88	PASS	DP=45	GT:DP	0/1:45

The -v flag inverts the match - shows lines that DON'T match. Here we exclude header lines (starting with #) from a VCF file.

Use grep -v to remove unwanted lines. Common use: grep -v '^#' removes comments, grep -v '^$' removes blank lines.

-c: Count Matches

Input0.35sSuccess

grep -c '^>' sequences.fasta

Output

27,655 sequences

The -c flag counts matching lines instead of printing them. Count FASTA sequences by counting headers (lines starting with >).

Input12.5sSuccess

grep -c 'high_quality' alignment_summary.txt

Output

42,567,890 high-quality reads

42567890

Count how many reads passed quality filtering in an alignment report.

-n: Show Line Numbers

Input0.42sSuccess

grep -n 'stop_codon' genes.gtf | head -n 3

Output

3 matches shown

1234:Chr1	TAIR10	stop_codon	5897	5899	.	-	0	gene_id=AT1G01010
5678:Chr1	TAIR10	stop_codon	9128	9130	.	+	0	gene_id=AT1G01020
9012:Chr1	TAIR10	stop_codon	13712	13714	.	-	0	gene_id=AT1G01030

The -n flag adds line numbers before each match. Useful for referencing specific locations in large files.

-A, -B, -C: Context Lines

Get context around matches:

Input0.18sSuccess

grep -A 2 '^>AT1G01010' arabidopsis.fasta

Output

>AT1G01010.1 | NAC001 | NAC domain protein | chr1:3631-5899 REVERSE
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
CTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTA

The -A flag shows lines After the match. -A 2 shows the matching line plus 2 lines after. Get header plus first 2 sequence lines.

Input8.2sSuccess

grep -B 1 'MAPQ:60' alignments.sam | head -n 6

Output

SRR001666.1234	0	Chr1	3631	60	72M	*	0	0	ATCGATCG...
SRR001666.1234	0	Chr1	3631	60	72M	*	0	0	...	MAPQ:60
--
SRR001666.5678	0	Chr1	8456	60	72M	*	0	0	GCTAGCTA...
SRR001666.5678	0	Chr1	8456	60	72M	*	0	0	...	MAPQ:60

The -B flag shows lines Before the match. Useful for seeing what precedes interesting patterns. The -- separates match groups.

Input0.05sSuccess

grep -C 1 'ERROR' pipeline.log

Output

[2025-11-20 15:23:45] Processing sample_03
[2025-11-20 15:23:48] ERROR: Failed to open input file
[2025-11-20 15:23:48] Attempting retry with alternative path

The -C flag shows Context (lines before AND after). -C 1 shows 1 line before and 1 line after each match.

Regular Expressions

Regular expressions (regex) make grep incredibly powerful. They let you match patterns, not just exact text.

Basic Regex Characters

Essential Regular Expression Patterns

1# Anchors

2^pattern # Match at start of line

3pattern$ # Match at end of line

4^pattern$ # Match entire line exactly

6# Character classes

7. # Any single character

8[ABC] # Any one character: A, B, or C

9[A-Z] # Any uppercase letter

10[0-9] # Any digit

11[^ABC] # Any character EXCEPT A, B, or C

13# Quantifiers

14* # Zero or more of previous character

15+ # One or more of previous character (use grep -E)

16? # Zero or one of previous character (use grep -E)

17{n} # Exactly n occurrences (use grep -E)

18{n,m} # Between n and m occurrences (use grep -E)

20# Special sequences

21\t # Tab character

22\s # Whitespace (use grep -P for Perl regex)

23\w # Word character (use grep -P)

24\d # Digit (use grep -P)

26# Grouping

27(pattern) # Group patterns (use grep -E)

28pattern1|pattern2 # Match pattern1 OR pattern2 (use grep -E)

Format Details

Anchors: Specify position in the line

Characters: Match specific characters or ranges

Quantifiers: Specify how many times to match

Special: Match special character types

Logic: Group and combine patterns

Find Start Codons

Input0.28sSuccess

grep '^ATG' coding_sequences.fasta

Output

15,432 sequences with ATG start

ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGCCCCACCTCTCTTCCCACCAA
ATGGCGGATGCTTCACCTTCTTCTCCCCTCGCCGCCTTCCTCTCCAACTCCTTCAGAGAG

The ^ anchor matches the start of a line. Find sequences beginning with ATG (start codon). Only matches sequence lines starting with ATG, not headers.

Find Stop Codons at Sequence End

Input0.45sSuccess

grep -E "(TAA|TAG|TGA)$" coding_sequences.fasta

Output

GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATAA
CCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGTGA

The $ anchor matches line end. -E enables extended regex for | (OR). Find sequences ending with any stop codon.

Match Quality Scores

Input25.3sSuccess

grep '[I-J]' sample.fastq | wc -l

Output

198,756,432 bases with Q40+

198756432

Character class [I-J] matches Phred quality scores I or J (Q40-41, highest quality). Count high-quality bases in FASTQ file.

Practical Bioinformatics Examples

Extract Specific Genes from FASTA

Extract Gene Sequences by ID

3 steps

grep '^>AT1G0101' arabidopsis.fasta

Output

>AT1G01010.1 | NAC001 | NAC domain protein
>AT1G01015.1 | Unknown protein
>AT1G01018.1 | Short protein

Filter VCF by Chromosome

Input1.2sSuccess

grep '^Chr1' variants.vcf | wc -l

Output

245,678 Chr1 variants

Count variants on chromosome 1. The ^ ensures we match Chr1 at the start, not in the middle of a line.

Input1.5sSuccess

grep -v '^#' variants.vcf | grep '^Chr1' | head -n 5 > chr1_variants.vcf

Extract first 5 Chr1 variants, excluding header lines. Combine grep commands to filter precisely.

Find High-Quality Alignments

Input45.2sSuccess

grep 'MAPQ:60' alignments.sam | wc -l

Output

38,234,567 perfect alignments

38234567

Count reads with perfect mapping quality (MAPQ 60) in a SAM file. These reads align to exactly one location.

Search Gene Annotations

Input0.18sSuccess

grep -i "zinc finger" gene_annotations.txt | cut -f1,2

Output

234 zinc finger genes

AT1G10480	C2H2-type zinc finger family protein
AT1G27730	C3HC4-type RING finger protein
AT1G51700	Zinc finger (C3HC4-type RING finger) protein
AT2G19130	B-box type zinc finger protein
AT3G46620	LSD1-like zinc finger protein

Find all genes with 'zinc finger' in their description. Case-insensitive search finds Zinc, zinc, ZINC, etc.

Extract FASTQ Reads by ID Pattern

Input2.3sSuccess

grep -A 3 '^@SRR001666.1[0-9][0-9][0-9] ' reads.fastq | head -n 20

Output

@SRR001666.1000 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR001666.1001 071112_SLXA-EAS1_s_7:5:1:818:346 length=72
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Extract FASTQ reads with IDs from 1000-1999. -A 3 gets the full 4-line FASTQ record. Pattern [0-9] matches any digit.

Combining grep with Pipes

grep's real power emerges when combined with other commands:

Count Sequences by Type

Input0.28sSuccess

grep '^>' sequences.fasta | grep -c 'mRNA'

Output

15,234 mRNA sequences

First grep extracts all headers, second grep counts those mentioning mRNA. Pipe connects the commands.

Find and Sort Gene Names

Input0.85sSuccess

grep "^>" proteins.fasta | cut -d"|" -f2 | sort -u | head -n 10

Output

ARV1
DCL1
NAC001
NGA3
PPA1
WRKY45
ZFP1
ZFP2

Extract headers, extract gene names (second field after |), sort uniquely, show first 10. Chain multiple tools to process data.

Filter and Count Variants

Count High-Quality Variants per Chromosome

2 steps

grep -v '^#' high_confidence.vcf | grep 'PASS' | cut -f1 | sort | uniq -c

Output

  245678 Chr1
  198234 Chr2
  234567 Chr3
  189123 Chr4
  176234 Chr5

Extended Regex with grep -E

For more complex patterns, use grep -E (extended grep) which supports +, ?, |, and grouping:

Input0.32sSuccess

grep -E "^(ATG|GTG|TTG)" start_codons.fasta

Output

ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
GTGAACACGAAGGACCACCAGATCACCCAAGTACCACCGC
TTGATGGCGGATGCTTCACCTTCTTCTCCCCTCGCCGCCTT

Find sequences starting with any of three alternative start codons. The | means OR, parentheses group the alternatives.

Input0.45sSuccess

grep -E "A{10,}" long_poly_a.fasta

Output

1,234 poly-A stretches

AAAAAAAAAAAAAATCGATCG
GCTAGCTAAAAAAAAAAAAAAAAGCTAGC
TTTTAAAAAAAAAAAAAAAAAAAAAAAGGG

Find sequences with 10 or more consecutive A's. The pattern means 10 or more repetitions. Useful for finding poly-A tails.

InputSuccess

grep -E "[GC]{3,}[AT]{3,}" gc_rich.fasta

Output

GGGCCCATTATCG
GCGCGCAAAATTT
CCCGGGTTAATTAA

Find sequences with 3+ G/C bases followed by 3+ A/T bases. Demonstrates complex pattern matching.

Recursive grep

Search through multiple files in a directory tree:

Input1.2sSuccess

grep -r 'kinase' annotations/

Output

annotations/chromosome1.gff:Chr1	TAIR10	gene	3631	5899	.	+	.	protein kinase
annotations/chromosome2.gff:Chr2	TAIR10	gene	8456	12345	.	-	.	tyrosine kinase
annotations/functions.txt:Protein kinase activity is essential for signaling

The -r flag searches recursively through all files in the annotations/ directory and its subdirectories.

Input3.5sSuccess

grep -r --include='*.gff' 'exon' annotations/ | wc -l

Output

456,789 exon annotations

Search recursively but only in .gff files. --include filters which files to search.

Performance Tips for Large Files

Use -F for Fixed Strings

Input8.2sSuccess

grep -F 'ATCGATCGATCG' huge_file.fastq

The -F flag treats the pattern as a fixed string, not a regex. Much faster for exact matches in large files.

Without -F, grep tries to interpret the pattern as regex, which is slower.

Limit Output Early

Input2.3sSuccess

grep 'MAPQ:60' alignments.sam | head -n 1000 > high_quality.sam

Pipe to head to stop after finding enough matches. Don't process the entire file if you only need a subset.

Use -m to Stop After Matches

Input0.12sSuccess

grep -m 1000 '^>' sequences.fasta

The -m flag stops after finding N matches. Faster than grep + head because grep stops reading the file.

Common Mistakes

Forgetting to Quote Patterns

Input

grep protein kinase genes.txt

Output

grep: kinase: No such file or directory
grep: genes.txt: No such file or directory

Without quotes, the shell treats 'protein', 'kinase', and 'genes.txt' as three separate arguments. grep thinks kinase and genes.txt are filenames.

Solution:

InputSuccess

grep 'protein kinase' genes.txt

Output

AT1G01010	NAC domain protein kinase
AT2G03400	Serine/threonine protein kinase

Quotes treat 'protein kinase' as a single pattern to search for.

Not Escaping Special Characters

InputSuccess

grep 'Chr1:3631-5899' annotations.gff

The - in the range is treated as a regex character class. This pattern means 'Chr1:363' followed by any digit from 1-5, then '899'.

Solution:

InputSuccess

grep -F 'Chr1:3631-5899' annotations.gff

Output

Chr1	TAIR10	gene	3631	5899	.	+	.	ID=AT1G01010

Use -F to treat the pattern as a fixed string. Now the - is literal.

Forgetting Line Start/End Anchors

InputSuccess

grep 'Chr1' variants.vcf | head -n 3

Output

##reference=hg38/Chr1.fasta
Chr1	12345	rs123	A	G	99	PASS	DP=52
Chr10	23456	.	C	T	87	PASS	DP=48

Without ^, this matches Chr1 anywhere in the line, including Chr10, Chr11, etc. and header comments.

Solution:

InputSuccess

grep '^Chr1\\t' variants.vcf | head -n 3

Output

Chr1	12345	rs123	A	G	99	PASS	DP=52
Chr1	23456	.	C	T	87	PASS	DP=48
Chr1	34567	rs456	G	A	95	PASS	DP=65

Use ^ for line start and \\t for tab. Now this matches only Chr1 as the first column, not Chr10 or headers.

Quick Reference

grep Commands Cheat Sheet

1# Basic usage

2grep 'pattern' file # Search for pattern in file

3grep 'pattern' file1 file2 # Search multiple files

4grep -r 'pattern' directory/ # Recursive search

6# Common flags

7grep -i 'pattern' file # Case-insensitive

8grep -v 'pattern' file # Invert match (exclude)

9grep -c 'pattern' file # Count matches

10grep -n 'pattern' file # Show line numbers

11grep -A 2 'pattern' file # Show 2 lines after match

12grep -B 2 'pattern' file # Show 2 lines before match

13grep -C 2 'pattern' file # Show 2 lines context

15# Performance

16grep -F 'fixed_string' file # Faster for exact matches

17grep -m 100 'pattern' file # Stop after 100 matches

19# Extended regex

20grep -E 'pat1|pat2' file # OR patterns

21grep -E 'A{5,}' file # Quantifiers

22grep -E '^(ATG|GTG)' file # Grouping

24# Bioinformatics examples

25grep '^>' file.fasta # FASTA headers

26grep -c '^@' file.fastq # Count FASTQ reads (approx)

27grep -v '^#' file.vcf # Remove VCF headers

28grep 'MAPQ:60' file.sam # High-quality alignments

Format Details

Basic: Fundamental grep usage

Flags: Most commonly used options

Speed: Optimization for large files

Regex: Extended regular expressions with -E

Genomics: Common bioinformatics patterns

Best Practices

grep Best Practices

Quote your patterns - Prevents shell interpretation of special characters
Use -F for exact matches - Much faster when you don't need regex
Anchor your patterns - Use ^ and $ to be specific about position
Combine with other tools - grep | cut | sort | uniq is powerful
Use -c for counting - Don't pipe to wc -l unnecessarily
Test on small files first - Verify patterns work before processing huge files
Consider memory - grep doesn't load files into memory, safe for huge files
Learn basic regex - Time invested in regex pays off forever

Practice Exercises

Practice in evomics-learn

Practice grep commands with real genomics data

Try these exercises on evomics-learn:

Extract genes by functional annotation
Filter VCF files by quality and chromosome
Find sequences with specific motifs
Count features in GFF files
Search logs for errors and warnings

Next Steps

Now that you can find patterns in files, the next section covers sed - stream editing. sed lets you transform text, replace patterns, and modify files without opening them in editors.

You'll learn:

Find and replace patterns
Delete specific lines
Transform file formats
Edit files in place
Chain sed commands for complex transformations