Evomics Docs
UNIX for Biologists/Viewing File Contents

Viewing File Contents

Bioinformatics files are often enormous - a single FASTQ file can contain hundreds of millions of lines. You cannot open these in a text editor. Instead, you use command-line tools to peek at contents, search through them, and extract specific sections.

These tools read files without loading the entire file into memory. This lets you examine 100 GB files on a laptop with 8 GB of RAM.

head - View the Beginning

The head command shows the first few lines of a file:

Input0.03sSuccess
head sequences.fasta
Output
>AT1G01010.1 | NAC001 | NAC domain protein | chr1:3631-5899 REVERSE
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
CTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTA
GCTCACGGCTTTGTCGGGCAGATCATTGAGCTAGTAGGAGGTTTCACGGGCATCAACCAA
>AT1G01020.1 | ARV1 | ARV1 family protein | chr1:6788-9130 FORWARD
ATGAACACGAAGGACCACCAGATCACCCAAGTACCACCGCCCCACCTCTCTTCCCACCAA
AATCCAATCAAAACGATAGTTTCTCCAACCAACCCATCTCCAACAACTTTAACTTCTTCT
>AT1G01030.1 | NGA3 | AP2 domain protein | chr1:11649-13714 FORWARD
ATGGCGGATGCTTCACCTTCTTCTCCCCTCGCCGCCTTCCTCTCCAACTCCTTCAGAGAG
GTCGGCGGTGGCGGTGGTGGCTTTCTCTCCGAAGGTGCCGGTGCCGGAGCTCCTCCTCCA

By default, head shows the first 10 lines. Perfect for checking file format before processing.

Specify Number of Lines

Input0.02sSuccess
head -n 4 sequences.fasta
Output
>AT1G01010.1 | NAC001 | NAC domain protein | chr1:3631-5899 REVERSE
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTAT
CTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTA
GCTCACGGCTTTGTCGGGCAGATCATTGAGCTAGTAGGAGGTTTCACGGGCATCAACCAA

The -n flag specifies exactly how many lines. Shows just the first sequence from the FASTA file.

Use head -n 4 on FASTQ files to see exactly one complete read (4 lines per read).

Practical Example: Check FASTQ Quality

Input0.02sSuccess
head -n 4 sample.fastq
Output
36 bp read lengthPhred 40 (high) quality
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

One complete FASTQ record. The quality line shows mostly 'I' (Phred 40), indicating high-quality sequencing.

Check Multiple Files

Input0.05sSuccess
head -n 1 *.fasta
Output
==> genome_chr1.fasta <==
>Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53:46

==> genome_chr2.fasta <==
>Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:53:52

==> genome_chr3.fasta <==
>Chr3 CHROMOSOME dumped from ADB: Jun/20/09 14:53:57

Check the headers of multiple FASTA files at once. head automatically shows which file each output comes from.

tail - View the End

The tail command shows the last few lines:

Input0.02sSuccess
tail analysis.log
Output
[2025-11-20 14:23:15] Alignment phase complete
[2025-11-20 14:23:18] Starting post-alignment QC
[2025-11-20 14:28:42] QC checks passed
[2025-11-20 14:28:45] Writing output files
[2025-11-20 14:32:10] Analysis complete
[2025-11-20 14:32:10] Total runtime: 2 hours 15 minutes
[2025-11-20 14:32:10] Output: aligned_sorted.bam
[2025-11-20 14:32:10] Exit status: SUCCESS

Check the end of a log file to see if analysis completed successfully. Default shows last 10 lines.

Follow Growing Files

InputSuccess
tail -f alignment.log
Output
[2025-11-20 15:30:12] Aligning sample_01_R1.fastq.gz
[2025-11-20 15:30:45] 10% complete (4.2M reads aligned)
[2025-11-20 15:31:18] 20% complete (8.5M reads aligned)
[2025-11-20 15:31:52] 30% complete (12.7M reads aligned)
...

The -f flag follows the file, showing new lines as they are written. Perfect for monitoring long-running jobs in real time. Press Ctrl+C to stop.

Monitor Running Jobs

Use tail -f to watch log files from running analyses. You can see progress, catch errors early, and know when jobs complete without repeatedly checking.

Show Specific Number of Lines

Input0.02sSuccess
tail -n 3 gene_counts.txt
Output
AT5G67590	2847
AT5G67600	4521
AT5G67610	1203

Show just the last 3 lines. Useful to see the end of data files without scrolling through entire output.

Practical Example: Check Pipeline Progress

Monitor Multi-Sample Pipeline

2 steps
tail -n 1 logs/sample_*.log
Output
==> logs/sample_01.log <==
[2025-11-20 15:45:23] Alignment complete (45.2M reads, 92.3% mapped)

==> logs/sample_02.log <==
[2025-11-20 15:50:12] Alignment complete (48.7M reads, 91.8% mapped)

==> logs/sample_03.log <==
[2025-11-20 15:32:18] ERROR: Out of memory during alignment

cat - Concatenate and Display

The cat command displays entire file contents:

Input0.01sSuccess
cat small_file.txt
Output
Sample	Condition	Replicate
Sample_01	Control	1
Sample_02	Control	2
Sample_03	Treatment	1
Sample_04	Treatment	2

cat prints the entire file to your terminal. Good for small files like metadata tables.

Never use cat on large files. A 100 GB BAM file will flood your terminal with binary garbage. Use head, tail, or less for large files.

Concatenate Multiple Files

InputSuccess
cat file1.txt file2.txt file3.txt > combined.txt

cat's original purpose: concatenate files. This combines three files into one. The > redirects output to a new file.

Display with Line Numbers

Input0.01sSuccess
cat -n gene_list.txt
Output
     1	AT1G01010
2	AT1G01020
3	AT1G01030
4	AT1G01040
5	AT1G01050
6	AT1G01060

The -n flag adds line numbers. Useful for referencing specific lines in data files.

Practical Example: Combine Sample Files

InputSuccess
cat sample_01_counts.txt sample_02_counts.txt sample_03_counts.txt > all_samples_counts.txt

Combine count files from multiple samples into one master file for downstream analysis in R or Python.

less - Interactive File Viewer

The less command is the best way to explore large files. It loads only what you're viewing, not the entire file.

InputSuccess
less large_alignment.sam
Output
@HD	VN:1.6	SO:coordinate
@SQ	SN:Chr1	LN:30427671
@SQ	SN:Chr2	LN:19698289
@SQ	SN:Chr3	LN:23459830
@PG	ID:STAR	PN:STAR	VN:2.7.10a
SRR001666.1	0	Chr1	3631	255	36M	*	0	0	GGGTGATGGCCG...
SRR001666.2	16	Chr1	3845	255	36M	*	0	0	ATCGATCGATCG...
:                                    <-- less shows : prompt at bottom

less opens the file in an interactive pager. Use arrow keys to scroll, / to search, q to quit. The file is not loaded entirely into memory.

Essential less Navigation

less Keyboard Commands

1# Navigation
2Space # Next page
3b # Previous page
4↓ or j # Down one line
5↑ or k # Up one line
6G # Jump to end of file
7g # Jump to beginning of file
850G # Jump to line 50
9
10# Search
11/pattern # Search forward for pattern
12?pattern # Search backward for pattern
13n # Next match
14N # Previous match
15
16# Display
17-N # Show line numbers
18-S # Disable line wrapping (useful for wide data)
19
20# Quit
21q # Exit less
Format Details
1
Navigation: Move through the file
10
Search: Find specific content within the file
16
Display Options: Change how content is displayed
20
Exit: Quit the viewer

Practical Example: Explore BAM File

Investigate Alignment File

3 steps
samtools view -h alignments.bam | less -S
Output
Opens interactive viewer showing SAM format with each alignment on one line

Use less -S for files with very long lines (SAM files, VCF files, wide tables). This prevents line wrapping and makes columnar data much easier to read.

wc - Word Count

The wc command counts lines, words, and characters:

Input1.2sSuccess
wc genome.fasta
Output
  123456   123456  3456789012 genome.fasta

Output format: lines, words, bytes, filename. This genome has 123,456 lines and is 3.4 GB.

Count Lines Only

Input0.02sSuccess
wc -l gene_list.txt
Output
27,655 genes
  27655 gene_list.txt

The -l flag counts only lines. The Arabidopsis genome has 27,655 genes in this annotation.

Count Sequences in FASTA

Input0.15sSuccess
grep -c '^>' sequences.fasta
Output
5,432 sequences
5432

Count FASTA sequences by counting header lines (starting with >). This file contains 5,432 sequences.

Count Reads in FASTQ

Input2.3sSuccess
echo $(( $(wc -l < reads.fastq) / 4 ))
Output
52,345,678 reads
52345678

FASTQ files have 4 lines per read. Divide line count by 4 to get read count. This file has 52 million reads.

Count Multiple Files

Input0.05sSuccess
wc -l *.txt
Output
   1245 sample_01_counts.txt
1245 sample_02_counts.txt
1245 sample_03_counts.txt
3735 total

Count lines in multiple files. wc automatically shows individual counts and a total.

Practical Workflows

Workflow 1: Validate Download

Verify Downloaded Sequencing Data

4 steps
ls -lh sample.fastq.gz
Output
-rw-r--r-- 1 user group 2.3G Nov 20 14:23 sample.fastq.gz

Workflow 2: Quick QC Check

Fast Quality Assessment

4 steps
head -n 4 sample.fastq
Output
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTATCCAGCCTGGAAGATGGCGACGCAGACCGACGCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Workflow 3: Compare Sample Depths

Check Sequencing Depth Across Samples

2 steps
for file in *.fastq.gz; do echo -n "$file: "; gunzip -c $file | wc -l | awk '{print $1/4}'; done
Output
sample_01.fastq.gz: 52345678
sample_02.fastq.gz: 48234567
sample_03.fastq.gz: 51123456
sample_04.fastq.gz: 49876543

Combining Commands with Pipes

The real power comes from chaining these tools together:

Input0.03sSuccess
head -n 1000 large_file.txt | tail -n 10
Output
Line 991
Line 992
Line 993
Line 994
Line 995
Line 996
Line 997
Line 998
Line 999
Line 1000

Get lines 991-1000. First take the first 1000 lines, then take the last 10 of those.

Input0.15sSuccess
cat *.txt | wc -l
Output
  125678

Combine all text files and count total lines across all of them.

Input0.05sSuccess
grep '^>' sequences.fasta | head -n 5
Output
>AT1G01010.1 | NAC001 | NAC domain protein
>AT1G01020.1 | ARV1 | ARV1 family protein
>AT1G01030.1 | NGA3 | AP2 domain protein
>AT1G01040.1 | DCL1 | Dicer-like protein
>AT1G01050.1 | PPA1 | Protein phosphatase 2A

Extract FASTA headers (lines starting with >) and show the first 5.

Pipes (|) send output from one command as input to the next. This is fundamental to UNIX philosophy: combine simple tools to solve complex problems.

Binary Files

Some bioinformatics formats are binary (BAM, BCF, compressed files). You cannot view them directly.

InputSuccess
head alignments.bam
Output
BAM☻↑☺☺☺À@HD VN:1.6 SO:coordinate@SQ SN:Chr1 LN:30427671@SQ...

Binary files show garbage characters. Never use regular viewing commands on binary formats.

Solution: Use format-specific tools:

Viewing Binary Bioinformatics Files

1# BAM files (binary alignment)
2samtools view -h file.bam | less
3
4# BCF files (binary VCF)
5bcftools view file.bcf | less
6
7# Compressed files (.gz)
8gunzip -c file.fastq.gz | head
9zcat file.fastq.gz | head # alternative
10zless file.fastq.gz # interactive viewing
11
12# HDF5 files
13h5dump file.h5 | less
Format Details
1
BAM: Use samtools to convert to readable SAM format
4
BCF: Use bcftools to convert to readable VCF format
7
Compressed: Use gunzip -c or zcat to decompress on-the-fly
12
HDF5: Use h5dump to convert to text representation

Always pipe binary-to-text converters through less or head. Never pipe directly to your terminal or you'll get screens of garbage characters.

Quick Reference

File Viewing Commands Cheat Sheet

1# View beginning of file
2head file.txt # First 10 lines
3head -n 20 file.txt # First 20 lines
4head -n 4 file.fastq # First FASTQ record
5
6# View end of file
7tail file.txt # Last 10 lines
8tail -n 20 file.txt # Last 20 lines
9tail -f running.log # Follow growing file (Ctrl+C to stop)
10
11# Display entire file
12cat file.txt # Print entire file (small files only!)
13cat -n file.txt # Print with line numbers
14cat file1 file2 > merged # Concatenate files
15
16# Interactive viewing
17less file.txt # Interactive pager (recommended for large files)
18less -S file.txt # No line wrapping
19less -N file.txt # Show line numbers
20
21# Count lines/words/characters
22wc file.txt # Lines, words, bytes
23wc -l file.txt # Lines only
24wc -w file.txt # Words only
25wc -c file.txt # Bytes only
26
27# Compressed files
28gunzip -c file.gz | head # View compressed file without decompressing
29zcat file.gz | less # Interactive viewing of compressed file
30zless file.gz # Alternative compressed viewer
31
32# Binary bioinformatics files
33samtools view file.bam | less # View BAM file
34bcftools view file.bcf | less # View BCF file
Format Details
1
head: Quick peek at start of file
6
tail: Check end of file or monitor logs
11
cat: Display or concatenate small files
16
less: Interactive viewing of large files
21
wc: Count lines, words, or characters
27
Compressed: View compressed files without extracting
32
Binary: Format-specific tools for binary files

Best Practices

File Viewing Best Practices
  1. Use less for large files: Never cat a multi-GB file
  2. Check compressed files without extracting: Use gunzip -c or zcat with pipes
  3. Monitor long jobs: Use tail -f on log files
  4. Verify downloads: head and tail to check file format and completeness
  5. Count before processing: wc -l to know dataset size
  6. Use format-specific tools: samtools, bcftools for binary formats
  7. Preview before full run: head -n 1000 to test pipelines on small data

Practice Exercises

Practice in evomics-learn

Practice file viewing commands interactively

Try these exercises on evomics-learn:

  1. Explore FASTA files with head and tail
  2. Count sequences in genomics files
  3. Monitor a simulated analysis log with tail -f
  4. Use less to search through annotation files
  5. Combine commands with pipes

Next Steps

You now have the fundamental skills for terminal navigation and file manipulation. The next major section covers text processing - the real power of the command line for bioinformatics.

You'll learn:

  • grep - Search for patterns in files
  • sed - Transform and edit text streams
  • awk - Process structured data
  • cut, sort, uniq - Extract and organize data

These tools let you process genomics files without writing programs, directly from the command line.

Further Reading