Evomics Docs
UNIX for Biologists/Stream Editing with sed

Stream Editing with sed

The sed command (stream editor) transforms text as it flows through. Unlike text editors that load files into memory, sed processes files line by line, making it perfect for editing multi-gigabyte genomics files.

Stream Editor

sed reads input line by line, applies transformations, and writes output. It never loads the entire file into memory, so it can process files of any size.

Why sed for Bioinformatics?

  • Transform file formats - Convert between naming conventions
  • Clean headers - Simplify FASTA/FASTQ identifiers
  • Batch editing - Modify thousands of files consistently
  • Fix formatting - Correct spacing, line endings, delimiters
  • Memory efficient - Process 100 GB files on a laptop

If you need to change text in a file, sed is probably the right tool. Learn the basics, and you'll use it constantly.

Basic Substitution

The most common sed operation: find and replace.

Syntax: s/old/new/

Input0.01sSuccess
echo 'sample_01_R1.fastq' | sed 's/fastq/fq/'
Output
sample_01_R1.fq

Basic substitution: s/pattern/replacement/. Reads from echo, replaces 'fastq' with 'fq', prints result.

Input0.15sSuccess
sed 's/Chr/chr/' input.vcf
Output
chr1	12345	rs123	A	G	99	PASS	DP=52
chr1	23456	.	C	T	87	PASS	DP=48
chr2	34567	rs456	G	A	95	PASS	DP=65

Replace 'Chr' with 'chr' in every line. By default, sed replaces only the first occurrence per line.

Replace All Occurrences: /g Flag

InputSuccess
echo 'ATGATGATG' | sed 's/ATG/---/'
Output
---ATGATG

Without /g, only the first ATG is replaced.

InputSuccess
echo 'ATGATGATG' | sed 's/ATG/---/g'
Output
---------

The /g flag means 'global' - replaces ALL occurrences on each line.

Practical Example: Rename Samples in Metadata

Input0.02sSuccess
sed 's/Sample_/S/g' sample_metadata.txt
Output
S01	Control	Replicate1
S02	Control	Replicate2
S03	Treatment	Replicate1
S04	Treatment	Replicate2

Replace Sample_ prefix with just S in all sample names. Shorter identifiers for downstream analysis.

In-Place Editing

By default, sed prints to stdout. Use -i to edit files directly:

InputSuccess
sed -i 's/Chr/chr/g' variants.vcf

The -i flag edits the file in place. The original file is modified. DANGER: No undo!

In-Place Editing is Permanent

sed -i overwrites the original file. There is no undo. Always test your sed command first without -i, or make a backup.

Safe In-Place Editing with Backup

InputSuccess
sed -i.bak 's/Chr/chr/g' variants.vcf

The -i.bak creates a backup as variants.vcf.bak before editing. Original file is preserved with .bak extension.

Verify the changes:

Safe sed Workflow

4 steps
cp important_file.txt important_file.txt.backup

Deleting Lines

Remove lines matching a pattern:

Input0.12sSuccess
sed '/^#/d' variants.vcf | head -n 5
Output
Chr1	12345	rs123	A	G	99	PASS	DP=52	GT:DP	0/1:52
Chr1	23456	.	C	T	87	PASS	DP=48	GT:DP	1/1:48
Chr2	34567	rs456	G	A	95	PASS	DP=65	GT:DP	0/1:65
Chr2	45678	.	T	C	92	PASS	DP=58	GT:DP	0/1:58
Chr3	56789	rs789	C	G	88	PASS	DP=45	GT:DP	0/1:45

The /pattern/d syntax deletes lines matching the pattern. Here we delete VCF header lines starting with #.

InputSuccess
sed '/^$/d' file_with_blanks.txt
Output
Line 1
Line 2
Line 3

Delete blank lines. ^$ matches lines with nothing between start (^) and end ($).

Delete Ranges

InputSuccess
sed '1,3d' sequences.fasta
Output
>AT1G01020.1 | ARV1 | ARV1 family protein
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGC
>AT1G01030.1 | NGA3 | AP2 domain protein

Delete lines 1-3. Useful for removing headers or unwanted entries at file start.

Multiple sed Commands

Apply several transformations at once:

Method 1: Multiple -e Flags

InputSuccess
sed -e 's/Chr/chr/g' -e 's/\.fastq/.fq/g' filenames.txt
Output
sample01_chr1.fq
sample02_chr2.fq
sample03_chr3.fq

Multiple -e flags apply transformations in order. First Chr→chr, then .fastq→.fq.

Method 2: Semicolon Separator

InputSuccess
sed 's/Chr/chr/g; s/\.fastq/.fq/g' filenames.txt
Output
sample01_chr1.fq
sample02_chr2.fq
sample03_chr3.fq

Semicolons separate commands. Same result as multiple -e flags, more compact syntax.

Addressing Specific Lines

Apply commands to specific line numbers:

InputSuccess
sed '1d' sequences.fasta
Output
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
>AT1G01020.1 | ARV1 | ARV1 family protein
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGC

Delete line 1. Remove the first header from a FASTA file.

Input0.01sSuccess
sed -n '1,4p' sample.fastq
Output
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Print lines 1-4. The -n flag suppresses default output, p prints specified lines. Extract exactly one FASTQ record.

Regular Expressions with sed

sed uses regular expressions just like grep:

InputSuccess
sed "s/[Ss]ample_0*/S/g" metadata.txt
Output
S1	Control
S2	Control
S3	Treatment
S4	Treatment

Pattern [Ss] matches S or s. Pattern 0* matches zero or more zeros. Handles Sample_01, sample_001, SAMPLE_1, etc.

InputSuccess
sed 's/^chr/Chr/' genomic_coords.bed
Output
Chr1	1000	2000	gene1
Chr2	3000	4000	gene2
Chr3	5000	6000	gene3

^ anchors to line start. Only replaces 'chr' at the beginning, not in the middle of lines.

Practical Bioinformatics Examples

Clean FASTA Headers

Input0.08sSuccess
sed 's/ .*//' proteins.fasta | head -n 4
Output
>AT1G01010.1
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
>AT1G01020.1
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGC

Remove everything after the first space in FASTA headers. Pattern ' .*' matches space followed by anything. Simplifies headers for downstream tools.

Before:

>AT1G01010.1 | NAC001 | NAC domain protein | chr1:3631-5899 REVERSE

After:

>AT1G01010.1

Convert FASTQ to FASTA

FASTQ to FASTA Conversion

2 steps
sed -n '1~4s/^@/>/p;2~4p' sample.fastq | head -n 6
Output
>SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>SRR001666.2 071112_SLXA-EAS1_s_7:5:1:818:346 length=72
ATCGATCGATCGATCGATCGATCGATCGATCGATCG

Fix Windows Line Endings

Input0.23sSuccess
sed 's/\\r$//' windows_file.txt > unix_file.txt

Remove carriage return (\\r) at end of lines. Converts Windows (CRLF) to UNIX (LF) line endings. Essential when mixing Windows and UNIX files.

Windows uses CRLF (\r\n) for line endings, UNIX uses LF (\n). Mixing formats causes problems for many bioinformatics tools.

Rename Chromosomes

InputSuccess
sed 's/^chromosome_/Chr/g' annotations.gff | head -n 3
Output
Chr1	RefSeq	gene	1000	2000	.	+	.	ID=gene1
Chr2	RefSeq	gene	3000	4000	.	+	.	ID=gene2
Chr3	RefSeq	gene	5000	6000	.	+	.	ID=gene3

Standardize chromosome naming. Change chromosome_1 to Chr1, chromosome_2 to Chr2, etc.

Extract Sample Names and Reformat

InputSuccess
sed -n "s/.*sample_\([0-9]*\).*/\1/p" filenames.txt
Output
01
02
03

Extract just the numbers from sample_01, sample_02, etc. Parentheses \\( \\) capture, \\1 refers to captured group.

Add Prefix to IDs

InputSuccess
sed 's/^>/'>PROJ_/' sequences.fasta | head -n 2
Output
>PROJ_AT1G01010.1 | NAC001 | NAC domain protein
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC

Add PROJ_ prefix to all FASTA headers. Useful for tracking sequences from different projects.

Case-Insensitive Substitution

InputSuccess
sed 's/chr/Chr/Ig' mixed_case.bed
Output
Chr1	1000	2000	gene1
Chr2	3000	4000	gene2
Chr3	5000	6000	gene3

The I flag makes matching case-insensitive. Replaces chr, Chr, CHR, cHr, etc. with Chr.

Advanced: Capture Groups

Capture parts of patterns and reuse them:

InputSuccess
sed "s/\(chr[0-9]*\):\([0-9]*\)-\([0-9]*\)/\1\t\2\t\3/" coords.txt
Output
chr1	3631	5899
chr2	8456	12345
chr3	15234	18976

Convert chr1:3631-5899 format to tab-separated columns. \\(pattern\\) captures, \\1 \\2 \\3 are the captured groups.

This converts genomic coordinates from:

chr1:3631-5899

To BED format:

chr1 3631 5899

Swap Columns

InputSuccess
sed "s/\([^\t]*\)\t\([^\t]*\)/\2\t\1/" sample_info.txt
Output
Control	Sample_01
Control	Sample_02
Treatment	Sample_03

Swap first two tab-separated columns. [^\\t]* matches anything except tab. Useful for reformatting data.

Multiple Files

Process many files at once:

InputSuccess
sed -i.bak 's/old_reference/new_reference/g' *.vcf

Edit all VCF files in place, creating .bak backups. Change reference genome identifier in all files simultaneously.

Batch Process Multiple Samples

2 steps
for file in *.fasta; do sed 's/^>/'>PROJECT_/' $file > processed_$file; done
Output
Created processed_sample01.fasta
Created processed_sample02.fasta
Created processed_sample03.fasta

Common Mistakes

Forgetting the /g Flag

InputSuccess
echo 'Chr1 Chr2 Chr3' | sed 's/Chr/chr/'
Output
chr1 Chr2 Chr3

Without /g, only the first match per line is replaced. Chr2 and Chr3 remain unchanged.

InputSuccess
echo 'Chr1 Chr2 Chr3' | sed 's/Chr/chr/g'
Output
chr1 chr2 chr3

With /g, all matches on the line are replaced.

Using -i Without Testing

InputSuccess
sed -i 's/AT/XX/g' important_sequences.fasta

DANGER: This replaces every AT in the file, including in sequences! Always test without -i first.

Better workflow:

Safe sed Testing

2 steps
sed 's/AT/XX/g' important_sequences.fasta | head -n 10
Output
Shows modified content without changing file

Not Escaping Special Characters

InputSuccess
sed 's/file.txt/file.bak/' filenames.txt
Output
file.bak

The . in the pattern matches ANY character, not a literal period. This accidentally matches 'fileXtxt', 'file-txt', etc.

InputSuccess
sed 's/file\\.txt/file.bak/' filenames.txt
Output
file.bak

Escape the period with \\. to match a literal period character.

Performance Considerations

sed is very fast, but here are tips for maximum performance:

sed Performance Tips
  1. Limit to needed lines: Use /pattern/!d to only process matching lines
  2. Quit early: Use q to quit after finding what you need
  3. Combine operations: One sed with multiple commands is faster than multiple sed calls
  4. Use simple patterns: Complex regex is slower than simple string matching
Input0.02sSuccess
sed -n '1,1000p;1000q' huge_file.txt

Print first 1000 lines then quit. Much faster than head for sed operations because it stops reading the file.

Quick Reference

sed Commands Cheat Sheet

1# Substitution
2sed 's/old/new/' file # Replace first occurrence per line
3sed 's/old/new/g' file # Replace all occurrences (global)
4sed 's/old/new/Ig' file # Case-insensitive replacement
5sed 's/old/new/2' file # Replace second occurrence only
6
7# In-place editing
8sed -i 's/old/new/g' file # Edit file directly (DANGER!)
9sed -i.bak 's/old/new/g' file # Edit with backup
10
11# Deleting
12sed '/pattern/d' file # Delete matching lines
13sed '1d' file # Delete line 1
14sed '1,10d' file # Delete lines 1-10
15sed '/^$/d' file # Delete blank lines
16
17# Printing
18sed -n '1,10p' file # Print lines 1-10
19sed -n '/pattern/p' file # Print matching lines
20
21# Multiple commands
22sed -e 's/a/A/g' -e 's/b/B/g' file
23sed 's/a/A/g; s/b/B/g' file # Same with semicolon
24
25# Advanced
26sed 's/\(pattern\)/\1_suffix/' file # Capture groups
27sed '1~4s/@/>/;2~4p' file.fastq # FASTQ to FASTA
Format Details
1
Replace: Find and replace text
7
In-place: Modify files directly
11
Delete: Remove lines
17
Print: Extract specific lines
21
Multiple: Chain commands together
25
Advanced: Regex and complex patterns

Best Practices

sed Best Practices
  1. Always test without -i first - Verify output before modifying files
  2. Use -i.bak for backups - Create backup files when editing in place
  3. Quote your patterns - Prevents shell interpretation issues
  4. Use /g for complete replacement - Remember global flag for all matches
  5. Escape special characters - Use \ before . * [ ] $ ^ and other regex characters
  6. One sed is better than many - Combine operations for better performance
  7. Document complex commands - Add comments explaining what sed does
  8. Keep it simple - If sed gets too complex, consider awk or Python

When Not to Use sed

sed is perfect for line-by-line transformations, but consider alternatives for:

  • Complex parsing - Use awk for field-based processing
  • Multi-line patterns - sed struggles with patterns spanning lines
  • Programming logic - Use Python/Perl for if/else, loops, variables
  • Binary files - sed is for text only

Practice Exercises

Practice in evomics-learn

Practice sed commands with genomics files

Try these exercises on evomics-learn:

  1. Clean FASTA headers for compatibility
  2. Convert file formats (FASTQ to FASTA)
  3. Standardize chromosome names across files
  4. Remove headers and comments from data files
  5. Batch process multiple samples with consistent edits

Next Steps

Now that you can transform text with sed, the next section covers awk - the most powerful text processing tool in UNIX. awk excels at column-based data processing, calculations, and conditional operations.

You'll learn:

  • Process tabular data (GFF, VCF, BED)
  • Calculate statistics on columns
  • Filter based on numeric thresholds
  • Reformat and extract specific fields
  • Combine awk with other tools for complex workflows

Further Reading