Stream Editing with sed

The sed command (stream editor) transforms text as it flows through. Unlike text editors that load files into memory, sed processes files line by line, making it perfect for editing multi-gigabyte genomics files.

Stream Editor

sed reads input line by line, applies transformations, and writes output. It never loads the entire file into memory, so it can process files of any size.

Why sed for Bioinformatics?

Transform file formats - Convert between naming conventions
Clean headers - Simplify FASTA/FASTQ identifiers
Batch editing - Modify thousands of files consistently
Fix formatting - Correct spacing, line endings, delimiters
Memory efficient - Process 100 GB files on a laptop

If you need to change text in a file, sed is probably the right tool. Learn the basics, and you'll use it constantly.

Basic Substitution

The most common sed operation: find and replace.

Syntax: s/old/new/

Input0.01sSuccess

echo 'sample_01_R1.fastq' | sed 's/fastq/fq/'

Output

sample_01_R1.fq

Basic substitution: s/pattern/replacement/. Reads from echo, replaces 'fastq' with 'fq', prints result.

Input0.15sSuccess

sed 's/Chr/chr/' input.vcf

Output

chr1	12345	rs123	A	G	99	PASS	DP=52
chr1	23456	.	C	T	87	PASS	DP=48
chr2	34567	rs456	G	A	95	PASS	DP=65

Replace 'Chr' with 'chr' in every line. By default, sed replaces only the first occurrence per line.

Replace All Occurrences: /g Flag

InputSuccess

echo 'ATGATGATG' | sed 's/ATG/---/'

Output

---ATGATG

Without /g, only the first ATG is replaced.

InputSuccess

echo 'ATGATGATG' | sed 's/ATG/---/g'

Output

---------

The /g flag means 'global' - replaces ALL occurrences on each line.

Practical Example: Rename Samples in Metadata

Input0.02sSuccess

sed 's/Sample_/S/g' sample_metadata.txt

Output

S01	Control	Replicate1
S02	Control	Replicate2
S03	Treatment	Replicate1
S04	Treatment	Replicate2

Replace Sample_ prefix with just S in all sample names. Shorter identifiers for downstream analysis.

In-Place Editing

By default, sed prints to stdout. Use -i to edit files directly:

InputSuccess

sed -i 's/Chr/chr/g' variants.vcf

The -i flag edits the file in place. The original file is modified. DANGER: No undo!

In-Place Editing is Permanent

sed -i overwrites the original file. There is no undo. Always test your sed command first without -i, or make a backup.

Safe In-Place Editing with Backup

InputSuccess

sed -i.bak 's/Chr/chr/g' variants.vcf

The -i.bak creates a backup as variants.vcf.bak before editing. Original file is preserved with .bak extension.

Verify the changes:

Safe sed Workflow

4 steps

cp important_file.txt important_file.txt.backup

Deleting Lines

Remove lines matching a pattern:

Input0.12sSuccess

sed '/^#/d' variants.vcf | head -n 5

Output

Chr1	12345	rs123	A	G	99	PASS	DP=52	GT:DP	0/1:52
Chr1	23456	.	C	T	87	PASS	DP=48	GT:DP	1/1:48
Chr2	34567	rs456	G	A	95	PASS	DP=65	GT:DP	0/1:65
Chr2	45678	.	T	C	92	PASS	DP=58	GT:DP	0/1:58
Chr3	56789	rs789	C	G	88	PASS	DP=45	GT:DP	0/1:45

The /pattern/d syntax deletes lines matching the pattern. Here we delete VCF header lines starting with #.

InputSuccess

sed '/^$/d' file_with_blanks.txt

Output

Line 1
Line 2
Line 3

Delete blank lines. ^$ matches lines with nothing between start (^) and end ($).

Delete Ranges

InputSuccess

sed '1,3d' sequences.fasta

Output

>AT1G01020.1 | ARV1 | ARV1 family protein
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGC
>AT1G01030.1 | NGA3 | AP2 domain protein

Delete lines 1-3. Useful for removing headers or unwanted entries at file start.

Multiple sed Commands

Apply several transformations at once:

Method 1: Multiple -e Flags

InputSuccess

sed -e 's/Chr/chr/g' -e 's/\.fastq/.fq/g' filenames.txt

Output

sample01_chr1.fq
sample02_chr2.fq
sample03_chr3.fq

Multiple -e flags apply transformations in order. First Chr→chr, then .fastq→.fq.

Method 2: Semicolon Separator

InputSuccess

sed 's/Chr/chr/g; s/\.fastq/.fq/g' filenames.txt

Output

sample01_chr1.fq
sample02_chr2.fq
sample03_chr3.fq

Semicolons separate commands. Same result as multiple -e flags, more compact syntax.

Addressing Specific Lines

Apply commands to specific line numbers:

InputSuccess

sed '1d' sequences.fasta

Output

ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
>AT1G01020.1 | ARV1 | ARV1 family protein
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGC

Delete line 1. Remove the first header from a FASTA file.

Input0.01sSuccess

sed -n '1,4p' sample.fastq

Output

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Print lines 1-4. The -n flag suppresses default output, p prints specified lines. Extract exactly one FASTQ record.

Regular Expressions with sed

sed uses regular expressions just like grep:

InputSuccess

sed "s/[Ss]ample_0*/S/g" metadata.txt

Output

S1	Control
S2	Control
S3	Treatment
S4	Treatment

Pattern [Ss] matches S or s. Pattern 0* matches zero or more zeros. Handles Sample_01, sample_001, SAMPLE_1, etc.

InputSuccess

sed 's/^chr/Chr/' genomic_coords.bed

Output

Chr1	1000	2000	gene1
Chr2	3000	4000	gene2
Chr3	5000	6000	gene3

^ anchors to line start. Only replaces 'chr' at the beginning, not in the middle of lines.

Practical Bioinformatics Examples

Clean FASTA Headers

Input0.08sSuccess

sed 's/ .*//' proteins.fasta | head -n 4

Output

>AT1G01010.1
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
>AT1G01020.1
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGC

Remove everything after the first space in FASTA headers. Pattern ' .*' matches space followed by anything. Simplifies headers for downstream tools.

Before:

>AT1G01010.1 | NAC001 | NAC domain protein | chr1:3631-5899 REVERSE

After:

>AT1G01010.1

Convert FASTQ to FASTA

FASTQ to FASTA Conversion

2 steps

sed -n '1~4s/^@/>/p;2~4p' sample.fastq | head -n 6

Output

>SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>SRR001666.2 071112_SLXA-EAS1_s_7:5:1:818:346 length=72
ATCGATCGATCGATCGATCGATCGATCGATCGATCG

Fix Windows Line Endings

Input0.23sSuccess

sed 's/\\r$//' windows_file.txt > unix_file.txt

Remove carriage return (\\r) at end of lines. Converts Windows (CRLF) to UNIX (LF) line endings. Essential when mixing Windows and UNIX files.

Windows uses CRLF (\r\n) for line endings, UNIX uses LF (\n). Mixing formats causes problems for many bioinformatics tools.

Rename Chromosomes

InputSuccess

sed 's/^chromosome_/Chr/g' annotations.gff | head -n 3

Output

Chr1	RefSeq	gene	1000	2000	.	+	.	ID=gene1
Chr2	RefSeq	gene	3000	4000	.	+	.	ID=gene2
Chr3	RefSeq	gene	5000	6000	.	+	.	ID=gene3

Standardize chromosome naming. Change chromosome_1 to Chr1, chromosome_2 to Chr2, etc.

Extract Sample Names and Reformat

InputSuccess

sed -n "s/.*sample_\([0-9]*\).*/\1/p" filenames.txt

Output

01
02
03

Extract just the numbers from sample_01, sample_02, etc. Parentheses \$ \$ capture, \\1 refers to captured group.

Add Prefix to IDs

InputSuccess

sed 's/^>/'>PROJ_/' sequences.fasta | head -n 2

Output

>PROJ_AT1G01010.1 | NAC001 | NAC domain protein
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC

Add PROJ_ prefix to all FASTA headers. Useful for tracking sequences from different projects.

Case-Insensitive Substitution

InputSuccess

sed 's/chr/Chr/Ig' mixed_case.bed

Output

Chr1	1000	2000	gene1
Chr2	3000	4000	gene2
Chr3	5000	6000	gene3

The I flag makes matching case-insensitive. Replaces chr, Chr, CHR, cHr, etc. with Chr.

Advanced: Capture Groups

Capture parts of patterns and reuse them:

InputSuccess

sed "s/\(chr[0-9]*\):\([0-9]*\)-\([0-9]*\)/\1\t\2\t\3/" coords.txt

Output

chr1	3631	5899
chr2	8456	12345
chr3	15234	18976

Convert chr1:3631-5899 format to tab-separated columns. \$pattern\$ captures, \\1 \\2 \\3 are the captured groups.

This converts genomic coordinates from:

chr1:3631-5899

To BED format:

chr1	3631	5899

Swap Columns

InputSuccess

sed "s/\([^\t]*\)\t\([^\t]*\)/\2\t\1/" sample_info.txt

Output

Control	Sample_01
Control	Sample_02
Treatment	Sample_03

Swap first two tab-separated columns. [^\\t]* matches anything except tab. Useful for reformatting data.

Multiple Files

Process many files at once:

InputSuccess

sed -i.bak 's/old_reference/new_reference/g' *.vcf

Edit all VCF files in place, creating .bak backups. Change reference genome identifier in all files simultaneously.

Batch Process Multiple Samples

2 steps

for file in *.fasta; do sed 's/^>/'>PROJECT_/' $file > processed_$file; done

Output

Created processed_sample01.fasta
Created processed_sample02.fasta
Created processed_sample03.fasta

Common Mistakes

Forgetting the /g Flag

InputSuccess

echo 'Chr1 Chr2 Chr3' | sed 's/Chr/chr/'

Output

chr1 Chr2 Chr3

Without /g, only the first match per line is replaced. Chr2 and Chr3 remain unchanged.

InputSuccess

echo 'Chr1 Chr2 Chr3' | sed 's/Chr/chr/g'

Output

chr1 chr2 chr3

With /g, all matches on the line are replaced.

Using -i Without Testing

InputSuccess

sed -i 's/AT/XX/g' important_sequences.fasta

DANGER: This replaces every AT in the file, including in sequences! Always test without -i first.

Better workflow:

Safe sed Testing

2 steps

sed 's/AT/XX/g' important_sequences.fasta | head -n 10

Output

Shows modified content without changing file

Not Escaping Special Characters

InputSuccess

sed 's/file.txt/file.bak/' filenames.txt

Output

file.bak

The . in the pattern matches ANY character, not a literal period. This accidentally matches 'fileXtxt', 'file-txt', etc.

InputSuccess

sed 's/file\\.txt/file.bak/' filenames.txt

Output

file.bak

Escape the period with \\. to match a literal period character.

Performance Considerations

sed is very fast, but here are tips for maximum performance:

sed Performance Tips

Limit to needed lines: Use /pattern/!d to only process matching lines
Quit early: Use q to quit after finding what you need
Combine operations: One sed with multiple commands is faster than multiple sed calls
Use simple patterns: Complex regex is slower than simple string matching

Input0.02sSuccess

sed -n '1,1000p;1000q' huge_file.txt

Print first 1000 lines then quit. Much faster than head for sed operations because it stops reading the file.

Quick Reference

sed Commands Cheat Sheet

1# Substitution

2sed 's/old/new/' file # Replace first occurrence per line

3sed 's/old/new/g' file # Replace all occurrences (global)

4sed 's/old/new/Ig' file # Case-insensitive replacement

5sed 's/old/new/2' file # Replace second occurrence only

7# In-place editing

8sed -i 's/old/new/g' file # Edit file directly (DANGER!)

9sed -i.bak 's/old/new/g' file # Edit with backup

11# Deleting

12sed '/pattern/d' file # Delete matching lines

13sed '1d' file # Delete line 1

14sed '1,10d' file # Delete lines 1-10

15sed '/^$/d' file # Delete blank lines

17# Printing

18sed -n '1,10p' file # Print lines 1-10

19sed -n '/pattern/p' file # Print matching lines

21# Multiple commands

22sed -e 's/a/A/g' -e 's/b/B/g' file

23sed 's/a/A/g; s/b/B/g' file # Same with semicolon

25# Advanced

26sed 's/$pattern$/\1_suffix/' file # Capture groups

27sed '1~4s/@/>/;2~4p' file.fastq # FASTQ to FASTA

Format Details

Replace: Find and replace text

In-place: Modify files directly

Delete: Remove lines

Print: Extract specific lines

Multiple: Chain commands together

Advanced: Regex and complex patterns

Best Practices

sed Best Practices

Always test without -i first - Verify output before modifying files
Use -i.bak for backups - Create backup files when editing in place
Quote your patterns - Prevents shell interpretation issues
Use /g for complete replacement - Remember global flag for all matches
Escape special characters - Use \ before . * [ ] $ ^ and other regex characters
One sed is better than many - Combine operations for better performance
Document complex commands - Add comments explaining what sed does
Keep it simple - If sed gets too complex, consider awk or Python

When Not to Use sed

sed is perfect for line-by-line transformations, but consider alternatives for:

Complex parsing - Use awk for field-based processing
Multi-line patterns - sed struggles with patterns spanning lines
Programming logic - Use Python/Perl for if/else, loops, variables
Binary files - sed is for text only

Practice Exercises

Practice in evomics-learn

Practice sed commands with genomics files

Try these exercises on evomics-learn:

Clean FASTA headers for compatibility
Convert file formats (FASTQ to FASTA)
Standardize chromosome names across files
Remove headers and comments from data files
Batch process multiple samples with consistent edits

Next Steps

Now that you can transform text with sed, the next section covers awk - the most powerful text processing tool in UNIX. awk excels at column-based data processing, calculations, and conditional operations.

You'll learn:

Process tabular data (GFF, VCF, BED)
Calculate statistics on columns
Filter based on numeric thresholds
Reformat and extract specific fields
Combine awk with other tools for complex workflows