Evomics Docs
UNIX for Biologists/Data Manipulation Tools

Data Manipulation Tools

Beyond grep, sed, and awk, UNIX provides specialized tools for extracting columns, sorting data, finding duplicates, and combining files. These tools are essential for organizing genomics data and preparing it for analysis.

cut - Extract Columns

The cut command extracts specific columns from tab or delimiter-separated files.

Extract Single Column

Input0.05sSuccess
cut -f1 genes.bed
Output
Chr1
Chr1
Chr2
Chr2
Chr3

The -f flag specifies fields (columns). -f1 extracts the first column. By default, cut uses tab as delimiter.

Extract Multiple Columns

Input0.06sSuccess
cut -f1,4 genes.bed
Output
Chr1	gene1
Chr1	gene2
Chr2	gene3
Chr2	gene4
Chr3	gene5

Extract columns 1 and 4. Comma-separated list of field numbers.

Extract Column Range

Input0.06sSuccess
cut -f2-4 genes.bed
Output
1000	2000	gene1
5000	6000	gene2
3000	4000	gene3

Extract columns 2 through 4. The dash specifies a range.

Custom Delimiter

Input0.04sSuccess
cut -d',' -f2,3 data.csv
Output
Control,Rep1
Control,Rep2
Treatment,Rep1

The -d flag sets the delimiter. -d',' for CSV files.

Practical Example: Extract Gene Names

Input0.15sSuccess
cut -f9 annotations.gff | cut -d';' -f1 | cut -d'=' -f2 | head -n 5
Output
AT1G01010
AT1G01020
AT1G01030
AT1G01040
AT1G01050

Chain multiple cut commands to extract gene IDs from GFF attributes. First get column 9, then split on semicolon, then split on equals sign.

sort - Order Data

The sort command arranges lines alphabetically or numerically.

Alphabetical Sort

Input0.08sSuccess
sort gene_names.txt
Output
ARV1
DCL1
NAC001
NGA3
PPA1
WRKY45

Default sort is alphabetical, case-sensitive.

Numeric Sort

Input0.12sSuccess
sort -n read_counts.txt
Output
145
234
892
1203
2847
4521

The -n flag sorts numerically. Without -n, '145' comes after '892' (alphabetical ordering).

Reverse Sort

Input0.18sSuccess
sort -nr read_counts.txt | head -n 5
Output
52,345,678 highest count
52345678
48234567
45123456
42567890
38456789

The -r flag reverses sort order. Combine -n and -r to get highest numbers first.

Sort by Specific Column

Input0.15sSuccess
sort -k2,2n genes_with_counts.txt
Output
gene5	145
gene3	234
gene8	892
gene1	1203
gene2	2847

-k2,2n sorts by column 2 numerically. The format is -k<start>,<end><type>.

Multi-Column Sort

Input0.22sSuccess
sort -k1,1 -k2,2n coordinates.bed | head -n 5
Output
Chr1	1000	2000	gene1
Chr1	5000	6000	gene2
Chr1	10000	11000	gene5
Chr2	3000	4000	gene3
Chr2	8000	9000	gene7

Sort by chromosome (column 1) alphabetically, then by start position (column 2) numerically. Multiple -k flags for tie-breaking.

Practical Example: Find Top Expressed Genes

Identify Highest Expression

2 steps
sort -k2,2nr expression.txt | head -n 10
Output
AT3G18780	45678.5
AT5G44420	42345.2
AT1G29930	38567.8
AT2G21330	35678.9
AT4G34200	32456.7

Sort by Multiple Criteria

InputSuccess
sort -k1,1 -k2,2n -k3,3nr variants.tsv | head -n 5
Output
Chr1	12345	99	rs123	A	G	PASS
Chr1	12345	87	rs124	A	C	PASS
Chr1	23456	95	.	C	T	PASS
Chr2	34567	92	rs456	G	A	PASS

Sort by chromosome, then position, then quality score (descending). Complex multi-level sorting for VCF-like data.

uniq - Find Unique Lines

The uniq command removes duplicate consecutive lines or counts occurrences.

uniq only removes consecutive duplicates. Always sort first: sort file | uniq

Remove Duplicates

Input0.08sSuccess
sort chromosomes.txt | uniq
Output
Chr1
Chr2
Chr3
Chr4
Chr5

Sort first, then uniq removes duplicates. Without sort, only consecutive duplicates are removed.

Count Occurrences

Input0.25sSuccess
cut -f1 genes.bed | sort | uniq -c
Output
   8234 Chr1
6543 Chr2
5432 Chr3
4321 Chr4
3210 Chr5

The -c flag counts occurrences. Shows how many genes per chromosome.

Show Only Duplicates

InputSuccess
sort read_ids.txt | uniq -d
Output
SRR001666.1234
SRR001666.5678
SRR001666.9012

The -d flag shows only duplicate lines (appearing more than once). Find duplicated read IDs.

Show Only Unique Lines

InputSuccess
sort samples.txt | uniq -u
Output
Sample_07
Sample_15
Sample_23

The -u flag shows only lines that appear exactly once. Find singletons.

Practical Example: Find Duplicate Sequences

Identify Duplicate Sequences

2 steps
grep -v '^>' sequences.fasta | sort | uniq -d | head -n 3
Output
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
TTAATTAATTAATTAATTAATTAATTAATTAATTAATTAA

paste - Combine Files Column-Wise

The paste command joins files side-by-side, adding columns.

Combine Two Files

Input0.04sSuccess
paste gene_names.txt expression_values.txt
Output
AT1G01010	145.5
AT1G01020	892.3
AT1G01030	234.7

paste joins files column-wise. First file becomes column 1, second file becomes column 2.

Custom Delimiter

InputSuccess
paste -d',' file1.txt file2.txt file3.txt
Output
Sample_01,Control,145.5
Sample_02,Treatment,234.8
Sample_03,Control,189.2

The -d flag sets the output delimiter. Create CSV from multiple files.

Serial Paste (Transpose)

InputSuccess
paste -s gene_list.txt
Output
AT1G01010	AT1G01020	AT1G01030	AT1G01040	AT1G01050

The -s flag pastes serially (all lines from one file onto one line). Converts column to row.

Practical Example: Create Sample Metadata

Build Metadata Table

2 steps
paste sample_ids.txt conditions.txt replicates.txt > metadata.tsv
Output
Created metadata.tsv with 48 samples

join - Merge Files by Key

The join command merges files based on a common field, like a database join.

Both files must be sorted by the join field before using join.

Basic Join

InputSuccess
join -1 1 -2 1 file1.txt file2.txt
Output
AT1G01010 145 protein_kinase
AT1G01020 892 zinc_finger
AT1G01030 234 transcription_factor

-1 1 means use column 1 from file 1 as key. -2 1 means use column 1 from file 2 as key. Matches are combined.

Tab-Delimited Join

Input0.18sSuccess
join -t $'\\t' -1 1 -2 1 counts.txt annotations.txt | head -n 3
Output
AT1G01010	145	NAC domain protein
AT1G01020	892	ARV1 family protein
AT1G01030	234	AP2 domain protein

-t sets field separator to tab. Combine expression counts with gene descriptions.

Outer Join

InputSuccess
join -a 1 -a 2 -t $'\\t' file1.txt file2.txt
Output
AT1G01010	145	NAC001
AT1G01020	892	ARV1
AT1G01030	234
AT1G01040		NGA3

-a 1 includes unpaired lines from file 1. -a 2 includes unpaired lines from file 2. Like a full outer join in SQL.

Practical Example: Add Gene Annotations to Counts

Annotate Expression Data

4 steps
sort -k1,1 expression_counts.txt > counts_sorted.txt
Output
Sorted 20,345 genes

Combining Tools

The real power comes from chaining these tools together:

Count Feature Types in GFF

Input2.5sSuccess
cut -f3 annotations.gff | sort | uniq -c | sort -nr
Output
  456789 exon
123456 CDS
87654 gene
45678 mRNA
12345 five_prime_UTR

Extract feature type column, sort, count occurrences, sort by count. Shows most common features first.

Find Genes with Highest Variant Density

Calculate Variants per Gene

2 steps
cut -f9 variants_in_genes.gff | cut -d';' -f1 | cut -d'=' -f2 | sort | uniq -c | sort -nr | head -n 10
Output
    234 AT3G18780
    189 AT5G44420
    156 AT1G29930
    145 AT2G21330
    134 AT4G34200

Create Summary Statistics per Chromosome

Per-Chromosome Statistics

2 steps
cut -f1,5,6 genes.gff | awk '{print $1, $3-$2}' | sort -k1,1 -k2,2n > lengths_by_chr.txt
Output
Created lengths_by_chr.txt

Quick Reference

Data Manipulation Cheat Sheet

1# cut - Extract columns
2cut -f1 file.txt # Column 1
3cut -f1,3 file.txt # Columns 1 and 3
4cut -f2-5 file.txt # Columns 2 through 5
5cut -d',' -f2 file.csv # Column 2 of CSV
6
7# sort - Order data
8sort file.txt # Alphabetical
9sort -n file.txt # Numeric
10sort -r file.txt # Reverse
11sort -k2,2n file.txt # By column 2 numerically
12sort -k1,1 -k2,2n file.txt # By column 1, then 2
13
14# uniq - Find unique lines
15sort file.txt | uniq # Remove duplicates
16sort file.txt | uniq -c # Count occurrences
17sort file.txt | uniq -d # Show only duplicates
18sort file.txt | uniq -u # Show only unique
19
20# paste - Combine column-wise
21paste file1 file2 # Join side-by-side
22paste -d',' file1 file2 # CSV output
23paste -s file.txt # Transpose to single line
24
25# join - Merge by key field
26join -t $'\t' -1 1 -2 1 file1 file2 # Join on column 1
27join -a 1 file1 file2 # Left outer join
28
29# Common combinations
30cut -f1 file | sort | uniq -c # Count unique values
31sort -k2,2nr file | head -n 10 # Top 10 by column 2
32cut -f1,3 file | paste - file2 # Extract and combine
Format Details
1
cut: Extract specific columns
7
sort: Order lines alphabetically or numerically
14
uniq: Remove or count duplicates (must sort first)
20
paste: Join files column-wise
25
join: Database-style join on key field
29
Combos: Powerful pipelines combining tools

Best Practices

Data Manipulation Best Practices
  1. Sort before uniq - uniq only works on consecutive lines
  2. Sort before join - Both files must be sorted by join field
  3. Use -k for complex sorts - Specify exact columns and types
  4. Test on small data - Verify logic before processing huge files
  5. Use head to check - Preview results before writing to files
  6. Combine with awk for calculations - cut extracts, awk calculates
  7. Check delimiter - Use -d for non-tab separators
  8. Preserve original data - Redirect to new files, don't overwrite

Common Pitfalls

Forgetting to Sort for uniq

InputSuccess
uniq chromosomes.txt
Output
Chr1
Chr2
Chr1
Chr3
Chr2

Wrong! uniq only removes consecutive duplicates. Chr1 appears twice because occurrences weren't consecutive.

InputSuccess
sort chromosomes.txt | uniq
Output
Chr1
Chr2
Chr3

Correct! Sort first, then uniq removes all duplicates.

Wrong Sort Type

InputSuccess
sort counts.txt
Output
100
1234
145
234
892

Alphabetical sort treats numbers as strings. '145' comes before '234' because '1' < '2'.

InputSuccess
sort -n counts.txt
Output
100
145
234
892
1234

Numeric sort (-n) handles numbers correctly.

Files Not Sorted for join

Input
join file1.txt file2.txt
Output
join: file1.txt:3: is not sorted

join requires both files to be sorted by the join key. Sort both files first.

Performance Tips

These tools are very efficient, but here are optimization tips:

Performance Optimization
  1. Use sort -S - Specify buffer size for huge files: sort -S 4G
  2. Sort in parallel - Use --parallel for multi-core systems
  3. Temporary directory - Set TMPDIR to fast storage for sort
  4. Cut early - Extract needed columns before sorting to reduce data
  5. Use -u with sort - sort -u is faster than sort | uniq
Input45.3sSuccess
cut -f1,2,3 huge_file.txt | sort -S 8G --parallel=8 -k1,1 -k2,2n > sorted.txt

Extract only needed columns, sort with 8GB buffer using 8 cores. Much faster than sorting all columns.

Practice Exercises

Practice in evomics-learn

Practice data manipulation with genomics files

Try these exercises on evomics-learn:

  1. Extract and count unique chromosomes
  2. Sort genes by expression level
  3. Find duplicate sequences in FASTA files
  4. Combine annotation files with expression data
  5. Calculate per-chromosome statistics

Next Steps

You now have a complete toolkit for text processing: grep (search), sed (transform), awk (analyze), and these data manipulation tools (extract, sort, combine). Together, these tools let you process any tabular biological data directly from the command line.

The next major section covers working with specific biological file formats:

  • FASTA and FASTQ processing
  • VCF variant manipulation
  • GFF/GTF annotation files
  • SAM/BAM alignment files

Further Reading