Stream Editing with sed
The sed command (stream editor) transforms text as it flows through. Unlike text editors that load files into memory, sed processes files line by line, making it perfect for editing multi-gigabyte genomics files.
sed reads input line by line, applies transformations, and writes output. It never loads the entire file into memory, so it can process files of any size.
Why sed for Bioinformatics?
- Transform file formats - Convert between naming conventions
- Clean headers - Simplify FASTA/FASTQ identifiers
- Batch editing - Modify thousands of files consistently
- Fix formatting - Correct spacing, line endings, delimiters
- Memory efficient - Process 100 GB files on a laptop
If you need to change text in a file, sed is probably the right tool. Learn the basics, and you'll use it constantly.
Basic Substitution
The most common sed operation: find and replace.
Syntax: s/old/new/
echo 'sample_01_R1.fastq' | sed 's/fastq/fq/'sample_01_R1.fqBasic substitution: s/pattern/replacement/. Reads from echo, replaces 'fastq' with 'fq', prints result.
sed 's/Chr/chr/' input.vcfchr1 12345 rs123 A G 99 PASS DP=52
chr1 23456 . C T 87 PASS DP=48
chr2 34567 rs456 G A 95 PASS DP=65Replace 'Chr' with 'chr' in every line. By default, sed replaces only the first occurrence per line.
Replace All Occurrences: /g Flag
echo 'ATGATGATG' | sed 's/ATG/---/'---ATGATGWithout /g, only the first ATG is replaced.
echo 'ATGATGATG' | sed 's/ATG/---/g'---------The /g flag means 'global' - replaces ALL occurrences on each line.
Practical Example: Rename Samples in Metadata
sed 's/Sample_/S/g' sample_metadata.txtS01 Control Replicate1
S02 Control Replicate2
S03 Treatment Replicate1
S04 Treatment Replicate2Replace Sample_ prefix with just S in all sample names. Shorter identifiers for downstream analysis.
In-Place Editing
By default, sed prints to stdout. Use -i to edit files directly:
sed -i 's/Chr/chr/g' variants.vcfThe -i flag edits the file in place. The original file is modified. DANGER: No undo!
sed -i overwrites the original file. There is no undo. Always test your sed command first without -i, or make a backup.
Safe In-Place Editing with Backup
sed -i.bak 's/Chr/chr/g' variants.vcfThe -i.bak creates a backup as variants.vcf.bak before editing. Original file is preserved with .bak extension.
Verify the changes:
Safe sed Workflow
4 stepsDeleting Lines
Remove lines matching a pattern:
sed '/^#/d' variants.vcf | head -n 5Chr1 12345 rs123 A G 99 PASS DP=52 GT:DP 0/1:52
Chr1 23456 . C T 87 PASS DP=48 GT:DP 1/1:48
Chr2 34567 rs456 G A 95 PASS DP=65 GT:DP 0/1:65
Chr2 45678 . T C 92 PASS DP=58 GT:DP 0/1:58
Chr3 56789 rs789 C G 88 PASS DP=45 GT:DP 0/1:45The /pattern/d syntax deletes lines matching the pattern. Here we delete VCF header lines starting with #.
sed '/^$/d' file_with_blanks.txtLine 1
Line 2
Line 3Delete blank lines. ^$ matches lines with nothing between start (^) and end ($).
Delete Ranges
sed '1,3d' sequences.fasta>AT1G01020.1 | ARV1 | ARV1 family protein
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGC
>AT1G01030.1 | NGA3 | AP2 domain proteinDelete lines 1-3. Useful for removing headers or unwanted entries at file start.
Multiple sed Commands
Apply several transformations at once:
Method 1: Multiple -e Flags
sed -e 's/Chr/chr/g' -e 's/\.fastq/.fq/g' filenames.txtsample01_chr1.fq
sample02_chr2.fq
sample03_chr3.fqMultiple -e flags apply transformations in order. First Chr→chr, then .fastq→.fq.
Method 2: Semicolon Separator
sed 's/Chr/chr/g; s/\.fastq/.fq/g' filenames.txtsample01_chr1.fq
sample02_chr2.fq
sample03_chr3.fqSemicolons separate commands. Same result as multiple -e flags, more compact syntax.
Addressing Specific Lines
Apply commands to specific line numbers:
sed '1d' sequences.fastaATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
>AT1G01020.1 | ARV1 | ARV1 family protein
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGCDelete line 1. Remove the first header from a FASTA file.
sed -n '1,4p' sample.fastq@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICPrint lines 1-4. The -n flag suppresses default output, p prints specified lines. Extract exactly one FASTQ record.
Regular Expressions with sed
sed uses regular expressions just like grep:
sed "s/[Ss]ample_0*/S/g" metadata.txtS1 Control
S2 Control
S3 Treatment
S4 TreatmentPattern [Ss] matches S or s. Pattern 0* matches zero or more zeros. Handles Sample_01, sample_001, SAMPLE_1, etc.
sed 's/^chr/Chr/' genomic_coords.bedChr1 1000 2000 gene1
Chr2 3000 4000 gene2
Chr3 5000 6000 gene3^ anchors to line start. Only replaces 'chr' at the beginning, not in the middle of lines.
Practical Bioinformatics Examples
Clean FASTA Headers
sed 's/ .*//' proteins.fasta | head -n 4>AT1G01010.1
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
>AT1G01020.1
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGCRemove everything after the first space in FASTA headers. Pattern ' .*' matches space followed by anything. Simplifies headers for downstream tools.
Before:
>AT1G01010.1 | NAC001 | NAC domain protein | chr1:3631-5899 REVERSE
After:
>AT1G01010.1
Convert FASTQ to FASTA
FASTQ to FASTA Conversion
2 stepsFix Windows Line Endings
sed 's/\\r$//' windows_file.txt > unix_file.txtRemove carriage return (\\r) at end of lines. Converts Windows (CRLF) to UNIX (LF) line endings. Essential when mixing Windows and UNIX files.
Windows uses CRLF (\r\n) for line endings, UNIX uses LF (\n). Mixing formats causes problems for many bioinformatics tools.
Rename Chromosomes
sed 's/^chromosome_/Chr/g' annotations.gff | head -n 3Chr1 RefSeq gene 1000 2000 . + . ID=gene1
Chr2 RefSeq gene 3000 4000 . + . ID=gene2
Chr3 RefSeq gene 5000 6000 . + . ID=gene3Standardize chromosome naming. Change chromosome_1 to Chr1, chromosome_2 to Chr2, etc.
Extract Sample Names and Reformat
sed -n "s/.*sample_\([0-9]*\).*/\1/p" filenames.txt01
02
03Extract just the numbers from sample_01, sample_02, etc. Parentheses \\( \\) capture, \\1 refers to captured group.
Add Prefix to IDs
sed 's/^>/'>PROJ_/' sequences.fasta | head -n 2>PROJ_AT1G01010.1 | NAC001 | NAC domain protein
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACAdd PROJ_ prefix to all FASTA headers. Useful for tracking sequences from different projects.
Case-Insensitive Substitution
sed 's/chr/Chr/Ig' mixed_case.bedChr1 1000 2000 gene1
Chr2 3000 4000 gene2
Chr3 5000 6000 gene3The I flag makes matching case-insensitive. Replaces chr, Chr, CHR, cHr, etc. with Chr.
Advanced: Capture Groups
Capture parts of patterns and reuse them:
sed "s/\(chr[0-9]*\):\([0-9]*\)-\([0-9]*\)/\1\t\2\t\3/" coords.txtchr1 3631 5899
chr2 8456 12345
chr3 15234 18976Convert chr1:3631-5899 format to tab-separated columns. \\(pattern\\) captures, \\1 \\2 \\3 are the captured groups.
This converts genomic coordinates from:
chr1:3631-5899
To BED format:
chr1 3631 5899
Swap Columns
sed "s/\([^\t]*\)\t\([^\t]*\)/\2\t\1/" sample_info.txtControl Sample_01
Control Sample_02
Treatment Sample_03Swap first two tab-separated columns. [^\\t]* matches anything except tab. Useful for reformatting data.
Multiple Files
Process many files at once:
sed -i.bak 's/old_reference/new_reference/g' *.vcfEdit all VCF files in place, creating .bak backups. Change reference genome identifier in all files simultaneously.
Batch Process Multiple Samples
2 stepsCommon Mistakes
Forgetting the /g Flag
echo 'Chr1 Chr2 Chr3' | sed 's/Chr/chr/'chr1 Chr2 Chr3Without /g, only the first match per line is replaced. Chr2 and Chr3 remain unchanged.
echo 'Chr1 Chr2 Chr3' | sed 's/Chr/chr/g'chr1 chr2 chr3With /g, all matches on the line are replaced.
Using -i Without Testing
sed -i 's/AT/XX/g' important_sequences.fastaDANGER: This replaces every AT in the file, including in sequences! Always test without -i first.
Better workflow:
Safe sed Testing
2 stepsNot Escaping Special Characters
sed 's/file.txt/file.bak/' filenames.txtfile.bakThe . in the pattern matches ANY character, not a literal period. This accidentally matches 'fileXtxt', 'file-txt', etc.
sed 's/file\\.txt/file.bak/' filenames.txtfile.bakEscape the period with \\. to match a literal period character.
Performance Considerations
sed is very fast, but here are tips for maximum performance:
- Limit to needed lines: Use
/pattern/!dto only process matching lines - Quit early: Use
qto quit after finding what you need - Combine operations: One sed with multiple commands is faster than multiple sed calls
- Use simple patterns: Complex regex is slower than simple string matching
sed -n '1,1000p;1000q' huge_file.txtPrint first 1000 lines then quit. Much faster than head for sed operations because it stops reading the file.
Quick Reference
sed Commands Cheat Sheet
Best Practices
- Always test without -i first - Verify output before modifying files
- Use -i.bak for backups - Create backup files when editing in place
- Quote your patterns - Prevents shell interpretation issues
- Use /g for complete replacement - Remember global flag for all matches
- Escape special characters - Use \ before . * [ ] $ ^ and other regex characters
- One sed is better than many - Combine operations for better performance
- Document complex commands - Add comments explaining what sed does
- Keep it simple - If sed gets too complex, consider awk or Python
When Not to Use sed
sed is perfect for line-by-line transformations, but consider alternatives for:
- Complex parsing - Use awk for field-based processing
- Multi-line patterns - sed struggles with patterns spanning lines
- Programming logic - Use Python/Perl for if/else, loops, variables
- Binary files - sed is for text only
Practice Exercises
Practice sed commands with genomics files
Try these exercises on evomics-learn:
- Clean FASTA headers for compatibility
- Convert file formats (FASTQ to FASTA)
- Standardize chromosome names across files
- Remove headers and comments from data files
- Batch process multiple samples with consistent edits
Next Steps
Now that you can transform text with sed, the next section covers awk - the most powerful text processing tool in UNIX. awk excels at column-based data processing, calculations, and conditional operations.
You'll learn:
- Process tabular data (GFF, VCF, BED)
- Calculate statistics on columns
- Filter based on numeric thresholds
- Reformat and extract specific fields
- Combine awk with other tools for complex workflows