Data Manipulation Tools
Beyond grep, sed, and awk, UNIX provides specialized tools for extracting columns, sorting data, finding duplicates, and combining files. These tools are essential for organizing genomics data and preparing it for analysis.
cut - Extract Columns
The cut command extracts specific columns from tab or delimiter-separated files.
Extract Single Column
cut -f1 genes.bedChr1
Chr1
Chr2
Chr2
Chr3The -f flag specifies fields (columns). -f1 extracts the first column. By default, cut uses tab as delimiter.
Extract Multiple Columns
cut -f1,4 genes.bedChr1 gene1
Chr1 gene2
Chr2 gene3
Chr2 gene4
Chr3 gene5Extract columns 1 and 4. Comma-separated list of field numbers.
Extract Column Range
cut -f2-4 genes.bed1000 2000 gene1
5000 6000 gene2
3000 4000 gene3Extract columns 2 through 4. The dash specifies a range.
Custom Delimiter
cut -d',' -f2,3 data.csvControl,Rep1
Control,Rep2
Treatment,Rep1The -d flag sets the delimiter. -d',' for CSV files.
Practical Example: Extract Gene Names
cut -f9 annotations.gff | cut -d';' -f1 | cut -d'=' -f2 | head -n 5AT1G01010
AT1G01020
AT1G01030
AT1G01040
AT1G01050Chain multiple cut commands to extract gene IDs from GFF attributes. First get column 9, then split on semicolon, then split on equals sign.
sort - Order Data
The sort command arranges lines alphabetically or numerically.
Alphabetical Sort
sort gene_names.txtARV1
DCL1
NAC001
NGA3
PPA1
WRKY45Default sort is alphabetical, case-sensitive.
Numeric Sort
sort -n read_counts.txt145
234
892
1203
2847
4521The -n flag sorts numerically. Without -n, '145' comes after '892' (alphabetical ordering).
Reverse Sort
sort -nr read_counts.txt | head -n 552345678
48234567
45123456
42567890
38456789The -r flag reverses sort order. Combine -n and -r to get highest numbers first.
Sort by Specific Column
sort -k2,2n genes_with_counts.txtgene5 145
gene3 234
gene8 892
gene1 1203
gene2 2847-k2,2n sorts by column 2 numerically. The format is -k<start>,<end><type>.
Multi-Column Sort
sort -k1,1 -k2,2n coordinates.bed | head -n 5Chr1 1000 2000 gene1
Chr1 5000 6000 gene2
Chr1 10000 11000 gene5
Chr2 3000 4000 gene3
Chr2 8000 9000 gene7Sort by chromosome (column 1) alphabetically, then by start position (column 2) numerically. Multiple -k flags for tie-breaking.
Practical Example: Find Top Expressed Genes
Identify Highest Expression
2 stepsSort by Multiple Criteria
sort -k1,1 -k2,2n -k3,3nr variants.tsv | head -n 5Chr1 12345 99 rs123 A G PASS
Chr1 12345 87 rs124 A C PASS
Chr1 23456 95 . C T PASS
Chr2 34567 92 rs456 G A PASSSort by chromosome, then position, then quality score (descending). Complex multi-level sorting for VCF-like data.
uniq - Find Unique Lines
The uniq command removes duplicate consecutive lines or counts occurrences.
uniq only removes consecutive duplicates. Always sort first: sort file | uniq
Remove Duplicates
sort chromosomes.txt | uniqChr1
Chr2
Chr3
Chr4
Chr5Sort first, then uniq removes duplicates. Without sort, only consecutive duplicates are removed.
Count Occurrences
cut -f1 genes.bed | sort | uniq -c 8234 Chr1
6543 Chr2
5432 Chr3
4321 Chr4
3210 Chr5The -c flag counts occurrences. Shows how many genes per chromosome.
Show Only Duplicates
sort read_ids.txt | uniq -dSRR001666.1234
SRR001666.5678
SRR001666.9012The -d flag shows only duplicate lines (appearing more than once). Find duplicated read IDs.
Show Only Unique Lines
sort samples.txt | uniq -uSample_07
Sample_15
Sample_23The -u flag shows only lines that appear exactly once. Find singletons.
Practical Example: Find Duplicate Sequences
Identify Duplicate Sequences
2 stepspaste - Combine Files Column-Wise
The paste command joins files side-by-side, adding columns.
Combine Two Files
paste gene_names.txt expression_values.txtAT1G01010 145.5
AT1G01020 892.3
AT1G01030 234.7paste joins files column-wise. First file becomes column 1, second file becomes column 2.
Custom Delimiter
paste -d',' file1.txt file2.txt file3.txtSample_01,Control,145.5
Sample_02,Treatment,234.8
Sample_03,Control,189.2The -d flag sets the output delimiter. Create CSV from multiple files.
Serial Paste (Transpose)
paste -s gene_list.txtAT1G01010 AT1G01020 AT1G01030 AT1G01040 AT1G01050The -s flag pastes serially (all lines from one file onto one line). Converts column to row.
Practical Example: Create Sample Metadata
Build Metadata Table
2 stepsjoin - Merge Files by Key
The join command merges files based on a common field, like a database join.
Both files must be sorted by the join field before using join.
Basic Join
join -1 1 -2 1 file1.txt file2.txtAT1G01010 145 protein_kinase
AT1G01020 892 zinc_finger
AT1G01030 234 transcription_factor-1 1 means use column 1 from file 1 as key. -2 1 means use column 1 from file 2 as key. Matches are combined.
Tab-Delimited Join
join -t $'\\t' -1 1 -2 1 counts.txt annotations.txt | head -n 3AT1G01010 145 NAC domain protein
AT1G01020 892 ARV1 family protein
AT1G01030 234 AP2 domain protein-t sets field separator to tab. Combine expression counts with gene descriptions.
Outer Join
join -a 1 -a 2 -t $'\\t' file1.txt file2.txtAT1G01010 145 NAC001
AT1G01020 892 ARV1
AT1G01030 234
AT1G01040 NGA3-a 1 includes unpaired lines from file 1. -a 2 includes unpaired lines from file 2. Like a full outer join in SQL.
Practical Example: Add Gene Annotations to Counts
Annotate Expression Data
4 stepsCombining Tools
The real power comes from chaining these tools together:
Count Feature Types in GFF
cut -f3 annotations.gff | sort | uniq -c | sort -nr 456789 exon
123456 CDS
87654 gene
45678 mRNA
12345 five_prime_UTRExtract feature type column, sort, count occurrences, sort by count. Shows most common features first.
Find Genes with Highest Variant Density
Calculate Variants per Gene
2 stepsCreate Summary Statistics per Chromosome
Per-Chromosome Statistics
2 stepsQuick Reference
Data Manipulation Cheat Sheet
Best Practices
- Sort before uniq - uniq only works on consecutive lines
- Sort before join - Both files must be sorted by join field
- Use -k for complex sorts - Specify exact columns and types
- Test on small data - Verify logic before processing huge files
- Use head to check - Preview results before writing to files
- Combine with awk for calculations - cut extracts, awk calculates
- Check delimiter - Use -d for non-tab separators
- Preserve original data - Redirect to new files, don't overwrite
Common Pitfalls
Forgetting to Sort for uniq
uniq chromosomes.txtChr1
Chr2
Chr1
Chr3
Chr2Wrong! uniq only removes consecutive duplicates. Chr1 appears twice because occurrences weren't consecutive.
sort chromosomes.txt | uniqChr1
Chr2
Chr3Correct! Sort first, then uniq removes all duplicates.
Wrong Sort Type
sort counts.txt100
1234
145
234
892Alphabetical sort treats numbers as strings. '145' comes before '234' because '1' < '2'.
sort -n counts.txt100
145
234
892
1234Numeric sort (-n) handles numbers correctly.
Files Not Sorted for join
join file1.txt file2.txtjoin: file1.txt:3: is not sortedjoin requires both files to be sorted by the join key. Sort both files first.
Performance Tips
These tools are very efficient, but here are optimization tips:
- Use sort -S - Specify buffer size for huge files:
sort -S 4G - Sort in parallel - Use --parallel for multi-core systems
- Temporary directory - Set TMPDIR to fast storage for sort
- Cut early - Extract needed columns before sorting to reduce data
- Use -u with sort -
sort -uis faster thansort | uniq
cut -f1,2,3 huge_file.txt | sort -S 8G --parallel=8 -k1,1 -k2,2n > sorted.txtExtract only needed columns, sort with 8GB buffer using 8 cores. Much faster than sorting all columns.
Practice Exercises
Practice data manipulation with genomics files
Try these exercises on evomics-learn:
- Extract and count unique chromosomes
- Sort genes by expression level
- Find duplicate sequences in FASTA files
- Combine annotation files with expression data
- Calculate per-chromosome statistics
Next Steps
You now have a complete toolkit for text processing: grep (search), sed (transform), awk (analyze), and these data manipulation tools (extract, sort, combine). Together, these tools let you process any tabular biological data directly from the command line.
The next major section covers working with specific biological file formats:
- FASTA and FASTQ processing
- VCF variant manipulation
- GFF/GTF annotation files
- SAM/BAM alignment files