Data Manipulation Tools

Beyond grep, sed, and awk, UNIX provides specialized tools for extracting columns, sorting data, finding duplicates, and combining files. These tools are essential for organizing genomics data and preparing it for analysis.

cut - Extract Columns

The cut command extracts specific columns from tab or delimiter-separated files.

Extract Single Column

Input0.05sSuccess

cut -f1 genes.bed

Output

Chr1
Chr1
Chr2
Chr2
Chr3

The -f flag specifies fields (columns). -f1 extracts the first column. By default, cut uses tab as delimiter.

Extract Multiple Columns

Input0.06sSuccess

cut -f1,4 genes.bed

Output

Chr1	gene1
Chr1	gene2
Chr2	gene3
Chr2	gene4
Chr3	gene5

Extract columns 1 and 4. Comma-separated list of field numbers.

Extract Column Range

Input0.06sSuccess

cut -f2-4 genes.bed

Output

1000	2000	gene1
5000	6000	gene2
3000	4000	gene3

Extract columns 2 through 4. The dash specifies a range.

Custom Delimiter

Input0.04sSuccess

cut -d',' -f2,3 data.csv

Output

Control,Rep1
Control,Rep2
Treatment,Rep1

The -d flag sets the delimiter. -d',' for CSV files.

Practical Example: Extract Gene Names

Input0.15sSuccess

cut -f9 annotations.gff | cut -d';' -f1 | cut -d'=' -f2 | head -n 5

Output

AT1G01010
AT1G01020
AT1G01030
AT1G01040
AT1G01050

Chain multiple cut commands to extract gene IDs from GFF attributes. First get column 9, then split on semicolon, then split on equals sign.

sort - Order Data

The sort command arranges lines alphabetically or numerically.

Alphabetical Sort

Input0.08sSuccess

sort gene_names.txt

Output

ARV1
DCL1
NAC001
NGA3
PPA1
WRKY45

Default sort is alphabetical, case-sensitive.

Numeric Sort

Input0.12sSuccess

sort -n read_counts.txt

Output

The -n flag sorts numerically. Without -n, '145' comes after '892' (alphabetical ordering).

Reverse Sort

Input0.18sSuccess

sort -nr read_counts.txt | head -n 5

Output

52,345,678 highest count

The -r flag reverses sort order. Combine -n and -r to get highest numbers first.

Sort by Specific Column

Input0.15sSuccess

sort -k2,2n genes_with_counts.txt

Output

gene5	145
gene3	234
gene8	892
gene1	1203
gene2	2847

-k2,2n sorts by column 2 numerically. The format is -k<start>,<end><type>.

Multi-Column Sort

Input0.22sSuccess

sort -k1,1 -k2,2n coordinates.bed | head -n 5

Output

Chr1	1000	2000	gene1
Chr1	5000	6000	gene2
Chr1	10000	11000	gene5
Chr2	3000	4000	gene3
Chr2	8000	9000	gene7

Sort by chromosome (column 1) alphabetically, then by start position (column 2) numerically. Multiple -k flags for tie-breaking.

Practical Example: Find Top Expressed Genes

Identify Highest Expression

2 steps

sort -k2,2nr expression.txt | head -n 10

Output

AT3G18780	45678.5
AT5G44420	42345.2
AT1G29930	38567.8
AT2G21330	35678.9
AT4G34200	32456.7

Sort by Multiple Criteria

InputSuccess

sort -k1,1 -k2,2n -k3,3nr variants.tsv | head -n 5

Output

Chr1	12345	99	rs123	A	G	PASS
Chr1	12345	87	rs124	A	C	PASS
Chr1	23456	95	.	C	T	PASS
Chr2	34567	92	rs456	G	A	PASS

Sort by chromosome, then position, then quality score (descending). Complex multi-level sorting for VCF-like data.

uniq - Find Unique Lines

The uniq command removes duplicate consecutive lines or counts occurrences.

uniq only removes consecutive duplicates. Always sort first: sort file | uniq

Remove Duplicates

Input0.08sSuccess

sort chromosomes.txt | uniq

Output

Chr1
Chr2
Chr3
Chr4
Chr5

Sort first, then uniq removes duplicates. Without sort, only consecutive duplicates are removed.

Count Occurrences

Input0.25sSuccess

cut -f1 genes.bed | sort | uniq -c

Output

   8234 Chr1
6543 Chr2
5432 Chr3
4321 Chr4
3210 Chr5

The -c flag counts occurrences. Shows how many genes per chromosome.

Show Only Duplicates

InputSuccess

sort read_ids.txt | uniq -d

Output

SRR001666.1234
SRR001666.5678
SRR001666.9012

The -d flag shows only duplicate lines (appearing more than once). Find duplicated read IDs.

Show Only Unique Lines

InputSuccess

sort samples.txt | uniq -u

Output

Sample_07
Sample_15
Sample_23

The -u flag shows only lines that appear exactly once. Find singletons.

Practical Example: Find Duplicate Sequences

Identify Duplicate Sequences

2 steps

grep -v '^>' sequences.fasta | sort | uniq -d | head -n 3

Output

ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGAC
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
TTAATTAATTAATTAATTAATTAATTAATTAATTAATTAA

paste - Combine Files Column-Wise

The paste command joins files side-by-side, adding columns.

Combine Two Files

Input0.04sSuccess

paste gene_names.txt expression_values.txt

Output

AT1G01010	145.5
AT1G01020	892.3
AT1G01030	234.7

paste joins files column-wise. First file becomes column 1, second file becomes column 2.

Custom Delimiter

InputSuccess

paste -d',' file1.txt file2.txt file3.txt

Output

Sample_01,Control,145.5
Sample_02,Treatment,234.8
Sample_03,Control,189.2

The -d flag sets the output delimiter. Create CSV from multiple files.

Serial Paste (Transpose)

InputSuccess

paste -s gene_list.txt

Output

AT1G01010	AT1G01020	AT1G01030	AT1G01040	AT1G01050

The -s flag pastes serially (all lines from one file onto one line). Converts column to row.

Practical Example: Create Sample Metadata

Build Metadata Table

2 steps

paste sample_ids.txt conditions.txt replicates.txt > metadata.tsv

Output

Created metadata.tsv with 48 samples

join - Merge Files by Key

The join command merges files based on a common field, like a database join.

Both files must be sorted by the join field before using join.

Basic Join

InputSuccess

join -1 1 -2 1 file1.txt file2.txt

Output

AT1G01010 145 protein_kinase
AT1G01020 892 zinc_finger
AT1G01030 234 transcription_factor

-1 1 means use column 1 from file 1 as key. -2 1 means use column 1 from file 2 as key. Matches are combined.

Tab-Delimited Join

Input0.18sSuccess

join -t $'\\t' -1 1 -2 1 counts.txt annotations.txt | head -n 3

Output

AT1G01010	145	NAC domain protein
AT1G01020	892	ARV1 family protein
AT1G01030	234	AP2 domain protein

-t sets field separator to tab. Combine expression counts with gene descriptions.

Outer Join

InputSuccess

join -a 1 -a 2 -t $'\\t' file1.txt file2.txt

Output

AT1G01010	145	NAC001
AT1G01020	892	ARV1
AT1G01030	234
AT1G01040		NGA3

-a 1 includes unpaired lines from file 1. -a 2 includes unpaired lines from file 2. Like a full outer join in SQL.

Practical Example: Add Gene Annotations to Counts

Annotate Expression Data

4 steps

sort -k1,1 expression_counts.txt > counts_sorted.txt

Output

Sorted 20,345 genes

Combining Tools

The real power comes from chaining these tools together:

Count Feature Types in GFF

Input2.5sSuccess

cut -f3 annotations.gff | sort | uniq -c | sort -nr

Output

  456789 exon
123456 CDS
87654 gene
45678 mRNA
12345 five_prime_UTR

Extract feature type column, sort, count occurrences, sort by count. Shows most common features first.

Find Genes with Highest Variant Density

Calculate Variants per Gene

2 steps

cut -f9 variants_in_genes.gff | cut -d';' -f1 | cut -d'=' -f2 | sort | uniq -c | sort -nr | head -n 10

Output

    234 AT3G18780
    189 AT5G44420
    156 AT1G29930
    145 AT2G21330
    134 AT4G34200

Create Summary Statistics per Chromosome

Per-Chromosome Statistics

2 steps

cut -f1,5,6 genes.gff | awk '{print $1, $3-$2}' | sort -k1,1 -k2,2n > lengths_by_chr.txt

Output

Created lengths_by_chr.txt

Quick Reference

Data Manipulation Cheat Sheet

1# cut - Extract columns

2cut -f1 file.txt # Column 1

3cut -f1,3 file.txt # Columns 1 and 3

4cut -f2-5 file.txt # Columns 2 through 5

5cut -d',' -f2 file.csv # Column 2 of CSV

7# sort - Order data

8sort file.txt # Alphabetical

9sort -n file.txt # Numeric

10sort -r file.txt # Reverse

11sort -k2,2n file.txt # By column 2 numerically

12sort -k1,1 -k2,2n file.txt # By column 1, then 2

14# uniq - Find unique lines

15sort file.txt | uniq # Remove duplicates

16sort file.txt | uniq -c # Count occurrences

17sort file.txt | uniq -d # Show only duplicates

18sort file.txt | uniq -u # Show only unique

20# paste - Combine column-wise

21paste file1 file2 # Join side-by-side

22paste -d',' file1 file2 # CSV output

23paste -s file.txt # Transpose to single line

25# join - Merge by key field

26join -t $'\t' -1 1 -2 1 file1 file2 # Join on column 1

27join -a 1 file1 file2 # Left outer join

29# Common combinations

30cut -f1 file | sort | uniq -c # Count unique values

31sort -k2,2nr file | head -n 10 # Top 10 by column 2

32cut -f1,3 file | paste - file2 # Extract and combine

Format Details

cut: Extract specific columns

sort: Order lines alphabetically or numerically

uniq: Remove or count duplicates (must sort first)

paste: Join files column-wise

join: Database-style join on key field

Combos: Powerful pipelines combining tools

Best Practices

Data Manipulation Best Practices

Sort before uniq - uniq only works on consecutive lines
Sort before join - Both files must be sorted by join field
Use -k for complex sorts - Specify exact columns and types
Test on small data - Verify logic before processing huge files
Use head to check - Preview results before writing to files
Combine with awk for calculations - cut extracts, awk calculates
Check delimiter - Use -d for non-tab separators
Preserve original data - Redirect to new files, don't overwrite

Common Pitfalls

Forgetting to Sort for uniq

InputSuccess

uniq chromosomes.txt

Output

Chr1
Chr2
Chr1
Chr3
Chr2

Wrong! uniq only removes consecutive duplicates. Chr1 appears twice because occurrences weren't consecutive.

InputSuccess

sort chromosomes.txt | uniq

Output

Chr1
Chr2
Chr3

Correct! Sort first, then uniq removes all duplicates.

Wrong Sort Type

InputSuccess

sort counts.txt

Output

Alphabetical sort treats numbers as strings. '145' comes before '234' because '1' < '2'.

InputSuccess

sort -n counts.txt

Output

Numeric sort (-n) handles numbers correctly.

Files Not Sorted for join

Input

join file1.txt file2.txt

Output

join: file1.txt:3: is not sorted

join requires both files to be sorted by the join key. Sort both files first.

Performance Tips

These tools are very efficient, but here are optimization tips:

Performance Optimization

Use sort -S - Specify buffer size for huge files: sort -S 4G
Sort in parallel - Use --parallel for multi-core systems
Temporary directory - Set TMPDIR to fast storage for sort
Cut early - Extract needed columns before sorting to reduce data
Use -u with sort - sort -u is faster than sort | uniq

Input45.3sSuccess

cut -f1,2,3 huge_file.txt | sort -S 8G --parallel=8 -k1,1 -k2,2n > sorted.txt

Extract only needed columns, sort with 8GB buffer using 8 cores. Much faster than sorting all columns.

Practice Exercises

Practice in evomics-learn

Practice data manipulation with genomics files

Try these exercises on evomics-learn:

Extract and count unique chromosomes
Sort genes by expression level
Find duplicate sequences in FASTA files
Combine annotation files with expression data
Calculate per-chromosome statistics

Next Steps

You now have a complete toolkit for text processing: grep (search), sed (transform), awk (analyze), and these data manipulation tools (extract, sort, combine). Together, these tools let you process any tabular biological data directly from the command line.

The next major section covers working with specific biological file formats:

FASTA and FASTQ processing
VCF variant manipulation
GFF/GTF annotation files
SAM/BAM alignment files