SAMtools

SAMtools
SAMtools
Original author	Heng Li
Developers	John Marshall and Petr Danecek et al
Initial release	2009
Stable release	1.21 / September 12, 2024; 13 months ago
Repository	github.com/samtools/samtools ;
Written in	C
Operating system	Unix-like
Type	Bioinformatics
License	BSD, MIT
Website	www.htslib.org

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map) and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion.^[3] SAM files can be very large (tens of Gigabytes is common), so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

As third-party projects were trying to use code from SAMtools despite it not being designed to be embedded in that way, the decision was taken in August 2014 to split the SAMtools package into a stand-alone software library with a well-defined API (HTSlib),^[4] a project for variant calling and manipulation of variant data (BCFtools), and the stand-alone SAMtools package for working with sequence alignment data.^[5]

Usage and commands

Like many Unix commands, SAMtool commands follow a stream model, where data runs through each command as if carried on a conveyor belt. This allows combining multiple commands into a data processing pipeline. Although the final output can be very complex, only a limited number of simple commands are needed to produce it. If not specified, the standard streams (stdin, stdout, and stderr) are assumed. Data sent to stdout are printed to the screen by default but are easily redirected to another file using the normal Unix redirectors (> and >>), or to another command via a pipe (|).

SAMtools commands

SAMtools provides the following commands, each invoked as samtools <subcommand>:

view: The view command filters SAM or BAM formatted data. Using options and arguments it understands what data to select (possibly all of it) and passes only that data through. Input is usually a sam or bam file specified as an argument, but could be sam or bam data piped from any other command. Possible uses include extracting a subset of data into a new file, converting between BAM and SAM formats, and just looking at the raw file contents. The order of extracted reads is preserved.
sort: The sort command sorts a BAM file based on its position in the reference, as determined by its alignment. The element + coordinate in the reference that the first matched base in the read aligns to is used as the key to order it by. [TODO: verify]. The sorted output is dumped to a new file by default, although it can be directed to stdout (using the -o option). As sorting is memory intensive and BAM files can be large, this command supports a sectioning mode (with the -m options) to use at most a given amount of memory and generate multiple output file. These files can then be merged to produce a complete sorted BAM file [TODO - investigate the details of this more carefully].
index: The index command creates a new index file that allows fast look-up of data in a (sorted) SAM or BAM. Like an index on a database, the generated *.sam.sai or *.bam.bai file allows programs that can read it to more efficiently work with the data in the associated files.
tview: The tview command starts an interactive ascii-based viewer that can be used to visualize how reads are aligned to specified small regions of the reference genome. Compared to a graphics based viewer like IGV,^[6] it has few features. Within the view, it is possible to jumping to different positions along reference elements (using 'g') and display help information ('?').
mpileup: The mpileup command produces a pileup format (or BCF) file giving, for each genomic coordinate, the overlapping read bases and indels at that position in the input BAM files(s). This can be used for SNP calling for example.
flagstat

Examples

view: samtools view sample.bam > sample.sam

Convert a bam file into a sam file.

samtools view -bS sample.sam > sample.bam

Convert a sam file into a bam file. The -b option compresses or leaves compressed input data.

samtools view sample_sorted.bam "chr1:10-13"

Extract all the reads aligned to the range specified, which are those that are aligned to the reference element named chr1 and cover its 10th, 11th, 12th or 13th base. The results is saved to a BAM file including the header. An index of the input file is required for extracting reads according to their mapping position in the reference genome, as created by samtools index.

samtools view -h -b sample_sorted.bam "chr1:10-13" > tiny_sorted.bam

Extract the same reads as above, but instead of displaying them, writes them to a new bam file, tiny_sorted.bam. The -b option makes the output compressed and the -h option causes the SAM headers to be output also. These headers include a description of the reference that the reads in sample_sorted.bam were aligned to and will be needed if the tiny_sorted.bam file is to be used with some of the more advanced SAMtools commands. The order of extracted reads is preserved.

tview: samtools tview sample_sorted.bam

Start an interactive viewer to visualize a small region of the reference, the reads aligned, and mismatches. Within the view, can jump to a new location by typing g: and a location, like g:chr1:10,000,000. If the reference element name and following colon is replaced with =, the current reference element is used, i.e. if g:=10,000,200 is typed after the previous "goto" command, the viewer jumps to the region 200 base pairs down on chr1. Typing ? brings up help information for scroll movement, colors, views, ...

samtools tview -p chrM:1 sample_chrM.bam UCSC_hg38.fa

Set start position and compare.

samtools tview -d T -p chrY:10,000,000 sample_chrY.bam UCSC_hg38.fa >> save.txt

samtools tview -d H -p chrY:10,000,000 sample_chrY.bam UCSC_hg38.fa >> save.html

Save screen in .txt or .html.

sort: samtools sort -o sorted_out unsorted_in.bam

Read the specified unsorted_in.bam as input, sort it by aligned read position, and write it out to sorted_out. Type of output can be either sam, bam, or cram, and will be determined automatically by sorted_out's file-extension.

samtools sort -m 5000000 unsorted_in.bam sorted_out

Read the specified unsorted_in.bam as input, sort it in blocks up to 5 million k (5 Gb)^{[units verification needed]} and write output to a series of bam files named sorted_out.0000.bam, sorted_out.0001.bam, etc., where all bam 0 reads come before any bam 1 read, etc.^{[verification needed]}

index: samtools index sorted.bam

Creates an index file, sorted.bam.bai for the sorted.bam file.

References

^ "SAM tools". SourceForge.
^ "Releases · samtools/samtools". github.com. Retrieved 2024-09-12.
^ Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. (August 2009). "The Sequence Alignment/Map format and SAMtools" (PDF). Bioinformatics. 25 (16): 2078–9. doi:10.1093/bioinformatics/btp352. PMC 2723002. PMID 19505943.
^ Bonfield JK, Marshall J, Danecek P, Li H, Ohan V, Whitwham A, et al. (February 2021). "HTSlib: C library for reading/writing high-throughput sequencing data". GigaScience. 10 (2). doi:10.1093/gigascience/giab007. PMC 7931820. PMID 33594436.
^ Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. (February 2021). "Twelve years of SAMtools and BCFtools". GigaScience. 10 (2). doi:10.1093/gigascience/giab008. PMC 7931819. PMID 33590861.
^ IGV

External links

[1] "SAM tools". SourceForge.

[2] "Releases · samtools/samtools". github.com. Retrieved 2024-09-12.

[pmid19505943-3] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. (August 2009). "The Sequence Alignment/Map format and SAMtools" (PDF). Bioinformatics. 25 (16): 2078–9. doi:10.1093/bioinformatics/btp352. PMC 2723002. PMID 19505943.

[4] Bonfield JK, Marshall J, Danecek P, Li H, Ohan V, Whitwham A, et al. (February 2021). "HTSlib: C library for reading/writing high-throughput sequencing data". GigaScience. 10 (2). doi:10.1093/gigascience/giab007. PMC 7931820. PMID 33594436.

[5] Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. (February 2021). "Twelve years of SAMtools and BCFtools". GigaScience. 10 (2). doi:10.1093/gigascience/giab008. PMC 7931819. PMID 33590861.

[6] IGV

[1]

[2]

[3]

[4]

[5]

[6]

v t e Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive, DNA Data Bank of Japan and China National GeneBank Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: BioNumbers, Protein Data Bank, Ensembl, InterPro, KEGG, and Gene Ontology Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource, GISAID and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE PANGOLIN SAMtools SOAP suite TopHat
Other	Server: ExPASy Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Joint Genome Institute (JGI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format GFF format GTF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons

Component	Description
view	Converts between SAM, BAM, and CRAM formats; filters alignments by region, flags, or quality; extracts FASTA/FASTQ sequences from alignments. Essential for initial inspection and subsetting of data.^[14]
sort	Sorts alignments by coordinate or query name, producing coordinate-sorted BAM files required for most downstream analyses; supports temporary file management for large inputs.^[14]^[10]
index	Generates .bai or .csi index files for BAM or CRAM, enabling fast random access to genomic regions without full file loading; also supports FASTA indexing via `faidx`.^[14]
mpileup	Generates a textual pileup of aligned bases at each genomic position; for BCF or VCF output suitable for variant calling, use bcftools mpileup. Includes options for indel handling and base quality adjustments.^[14]^[9]
merge	Combines multiple sorted alignment files into one, preserving metadata; useful for aggregating results from parallel processing or multi-sample experiments.^[14]
markdup	Identifies and flags PCR duplicates in sorted alignments based on mapping position and orientation; outputs updated BAM with duplicate metrics.^[14]^[10]
stats	Generates comprehensive statistics on alignments, including total reads, mapping rates, insert size distributions, and per-chromosome coverage; aids in quality assessment.^[14]

Field	Description	Example
QNAME	Query name (read identifier), up to 255 characters	read_001
FLAG	Bitwise flag indicating read properties (e.g., paired, unmapped)	0 (unpaired, mapped)
RNAME	Reference sequence name, or '*' if unmapped	chr1
POS	1-based leftmost mapping position, or 0 if unmapped	1000
MAPQ	Mapping quality (Phred-scaled probability of random placement), 255 for unavailable	60
CIGAR	Concise Idiosyncratic Gapped Alignment Report string describing matches, insertions, deletions, etc.	50M (50 matches)
MRNM	Mate reference name for paired reads, or '*' if unavailable	'=' (same as RNAME)
MPOS	1-based position of mate read	2000
TLEN	Observed template length (insert size), signed	1000
SEQ	Query sequence as a string of ACGTN, or '*' if unavailable	AGCT...
QUAL	ASCII-encoded Phred quality scores for SEQ bases (+33 offset), or '*' if unavailable	!''*+...

History

SAMtools

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

SAMtools

Usage and commands

SAMtools commands

Examples

See also

References

External links

SAMtools

Introduction

Overview

Key Components

History

Development Origins

Major Releases

Supported Formats

SAM Format

BAM Format

CRAM Format

Core Commands

File Manipulation

Alignment Processing

Statistics and Inspection

Usage and Examples

Basic Usage

Common Workflows

Integration and Extensions

With HTSlib and BCFtools

With Other Bioinformatics Tools

References

Add your contribution

Related Hubs

Contribute something