Hubbry Logo
SAMtoolsSAMtoolsMain
Open search
SAMtools
Community hub
SAMtools
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
SAMtools
SAMtools
from Wikipedia
SAMtools
Original authorHeng Li
DevelopersJohn Marshall and Petr Danecek et al [1]
Initial release2009
Stable release
1.21 / September 12, 2024; 13 months ago (2024-09-12)[2]
Repository
Written inC
Operating systemUnix-like
TypeBioinformatics
LicenseBSD, MIT
Websitewww.htslib.org Edit this on Wikidata

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map) and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion.[3] SAM files can be very large (tens of Gigabytes is common), so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

As third-party projects were trying to use code from SAMtools despite it not being designed to be embedded in that way, the decision was taken in August 2014 to split the SAMtools package into a stand-alone software library with a well-defined API (HTSlib),[4] a project for variant calling and manipulation of variant data (BCFtools), and the stand-alone SAMtools package for working with sequence alignment data.[5]

Usage and commands

[edit]

Like many Unix commands, SAMtool commands follow a stream model, where data runs through each command as if carried on a conveyor belt. This allows combining multiple commands into a data processing pipeline. Although the final output can be very complex, only a limited number of simple commands are needed to produce it. If not specified, the standard streams (stdin, stdout, and stderr) are assumed. Data sent to stdout are printed to the screen by default but are easily redirected to another file using the normal Unix redirectors (> and >>), or to another command via a pipe (|).

SAMtools commands

[edit]

SAMtools provides the following commands, each invoked as samtools <subcommand>:

view
The view command filters SAM or BAM formatted data. Using options and arguments it understands what data to select (possibly all of it) and passes only that data through. Input is usually a sam or bam file specified as an argument, but could be sam or bam data piped from any other command. Possible uses include extracting a subset of data into a new file, converting between BAM and SAM formats, and just looking at the raw file contents. The order of extracted reads is preserved.
sort
The sort command sorts a BAM file based on its position in the reference, as determined by its alignment. The element + coordinate in the reference that the first matched base in the read aligns to is used as the key to order it by. [TODO: verify]. The sorted output is dumped to a new file by default, although it can be directed to stdout (using the -o option). As sorting is memory intensive and BAM files can be large, this command supports a sectioning mode (with the -m options) to use at most a given amount of memory and generate multiple output file. These files can then be merged to produce a complete sorted BAM file [TODO - investigate the details of this more carefully].
index
The index command creates a new index file that allows fast look-up of data in a (sorted) SAM or BAM. Like an index on a database, the generated *.sam.sai or *.bam.bai file allows programs that can read it to more efficiently work with the data in the associated files.
tview
The tview command starts an interactive ascii-based viewer that can be used to visualize how reads are aligned to specified small regions of the reference genome. Compared to a graphics based viewer like IGV,[6] it has few features. Within the view, it is possible to jumping to different positions along reference elements (using 'g') and display help information ('?').
mpileup
The mpileup command produces a pileup format (or BCF) file giving, for each genomic coordinate, the overlapping read bases and indels at that position in the input BAM files(s). This can be used for SNP calling for example.
flagstat

Examples

[edit]
view
samtools view sample.bam > sample.sam

Convert a bam file into a sam file.

samtools view -bS sample.sam > sample.bam

Convert a sam file into a bam file. The -b option compresses or leaves compressed input data.

samtools view sample_sorted.bam "chr1:10-13"

Extract all the reads aligned to the range specified, which are those that are aligned to the reference element named chr1 and cover its 10th, 11th, 12th or 13th base. The results is saved to a BAM file including the header. An index of the input file is required for extracting reads according to their mapping position in the reference genome, as created by samtools index.

samtools view -h -b sample_sorted.bam "chr1:10-13" > tiny_sorted.bam

Extract the same reads as above, but instead of displaying them, writes them to a new bam file, tiny_sorted.bam. The -b option makes the output compressed and the -h option causes the SAM headers to be output also. These headers include a description of the reference that the reads in sample_sorted.bam were aligned to and will be needed if the tiny_sorted.bam file is to be used with some of the more advanced SAMtools commands. The order of extracted reads is preserved.

tview
samtools tview sample_sorted.bam

Start an interactive viewer to visualize a small region of the reference, the reads aligned, and mismatches. Within the view, can jump to a new location by typing g: and a location, like g:chr1:10,000,000. If the reference element name and following colon is replaced with =, the current reference element is used, i.e. if g:=10,000,200 is typed after the previous "goto" command, the viewer jumps to the region 200 base pairs down on chr1. Typing ? brings up help information for scroll movement, colors, views, ...

samtools tview -p chrM:1 sample_chrM.bam UCSC_hg38.fa

Set start position and compare.

samtools tview -d T -p chrY:10,000,000 sample_chrY.bam UCSC_hg38.fa >> save.txt
samtools tview -d H -p chrY:10,000,000 sample_chrY.bam UCSC_hg38.fa >> save.html

Save screen in .txt or .html.

sort
samtools sort -o sorted_out unsorted_in.bam

Read the specified unsorted_in.bam as input, sort it by aligned read position, and write it out to sorted_out. Type of output can be either sam, bam, or cram, and will be determined automatically by sorted_out's file-extension.

samtools sort -m 5000000 unsorted_in.bam sorted_out

Read the specified unsorted_in.bam as input, sort it in blocks up to 5 million k (5 Gb)[units verification needed] and write output to a series of bam files named sorted_out.0000.bam, sorted_out.0001.bam, etc., where all bam 0 reads come before any bam 1 read, etc.[verification needed]

index
samtools index sorted.bam

Creates an index file, sorted.bam.bai for the sorted.bam file.

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
SAMtools is a suite of command-line utilities written in C for manipulating and analyzing high-throughput sequencing data, primarily focused on processing alignments stored in the Sequence Alignment/Map (SAM), Binary Alignment/Map (BAM), and CRAM formats. It enables essential operations such as viewing, sorting, indexing, merging, and generating per-position summaries from read alignments, supporting efficient handling of large datasets from next-generation sequencing platforms. Originally developed to facilitate post-processing of alignments for projects like the , SAMtools has become a foundational tool in bioinformatics pipelines worldwide. Introduced in 2009 by Heng Li and colleagues from the Broad Institute, Wellcome Trust Sanger Institute, and other institutions as part of the Data Processing Subgroup, SAMtools was released alongside the specification of the SAM format—a flexible, tab-delimited text format for representing sequence alignments against reference genomes. The software addressed the need for standardized, efficient tools to manage the growing volume of sequencing data, with its binary BAM format providing compact storage and rapid capabilities. Over time, the project evolved: in 2010, variant calling components were separated into BCFtools, and by 2014, the codebase was restructured into three coordinated repositories—HTSlib (a C library for high-throughput sequencing data I/O), SAMtools (for alignment manipulation), and BCFtools (for variant data)—to improve modularity and maintenance. This restructuring marked the 1.0 release of SAMtools, which doubled the codebase size and introduced support for the CRAM format for further compression. Key features of SAMtools include threading for parallel processing (added in version 0.1.19), on-the-fly indexing during file writing (version 1.10), and utilities like samtools view for format conversion and subsetting, samtools sort for coordinate or name ordering, samtools index for enabling fast queries, and samtools stats for generating alignment summaries. It supports alignments from short reads (e.g., Illumina) to long reads (up to 128 Mbp from PacBio or Nanopore), making it versatile across sequencing technologies and species, including vertebrates, plants, and microbes. SAMtools has had a profound impact on research, with its original 2009 publication cited over 56,000 times and the software installed more than seven million times via package managers like Bioconda. Actively maintained under the on , it features extensive testing (>700 unit tests), across platforms, and ongoing enhancements for handling massive datasets and 64-bit integer support for large genomes, with the latest stable release (version 1.22) in 2025. As of 2025, SAMtools and its sister projects continue to underpin major genomic analyses, from detection to studies, ensuring compatibility with evolving standards in the field.

Introduction

Overview

SAMtools is a widely used in bioinformatics for processing and analyzing high-throughput sequencing data, particularly alignments between sequencing reads and reference genomes. It provides a collection of command-line utilities that enable efficient manipulation of files in the Sequence Alignment/Map (SAM), Binary Alignment/Map (BAM), and CRAM formats, supporting tasks such as viewing alignments, sorting, merging, indexing, and generating consensus sequences. Developed to address the challenges of handling massive datasets from next-generation sequencing technologies, SAMtools facilitates downstream analyses like variant calling and structural variant detection by ensuring fast and reliable data access. The core of SAMtools relies on the HTSlib library, which implements low-level input/output operations for compressed and indexed genomic files, allowing for random access to specific regions without loading entire datasets into . This design choice enhances , making it suitable for terabyte-scale alignment files common in modern projects. Originally introduced as part of the , SAMtools has become a foundational tool in sequencing pipelines, integrated with aligners like BWA and variant callers like GATK. Over the years, the project has expanded to include complementary tools like BCFtools for handling variant data in binary call format (BCF), reflecting its evolution into a comprehensive for genomic . Its open-source nature and active maintenance by a global developer community ensure compatibility with emerging sequencing technologies, including long-read platforms. SAMtools remains essential for reproducible research, with its utilities cited in thousands of studies for enabling scalable bioinformatics workflows.

Key Components

SAMtools is a modular software suite centered on the HTSlib library and a collection of command-line tools for manipulating high-throughput sequencing alignments in SAM, BAM, and CRAM formats. HTSlib, a C library developed as part of the project, provides low-level input/output operations, including parsing, compression, and indexing support, enabling efficient handling of large datasets across local and remote files. This library forms the backbone, allowing tools to perform operations like format conversion and region-based querying without redundant code. The suite's primary tools are subcommands under the samtools executable, categorized into file manipulation, processing, and analysis functions. These utilities support Unix-style piping for seamless integration into bioinformatics workflows and leverage multi-threading for performance on modern hardware. Originally introduced to address the need for standardized alignment handling, the components emphasize speed and compatibility with evolving sequencing technologies.
ComponentDescription
viewConverts between SAM, BAM, and CRAM formats; filters alignments by region, flags, or quality; extracts /FASTQ sequences from alignments. Essential for initial inspection and subsetting of data.
sortSorts alignments by coordinate or query name, producing coordinate-sorted BAM files required for most downstream analyses; supports temporary file management for large inputs.
indexGenerates .bai or .csi index files for BAM or CRAM, enabling fast random access to genomic regions without full file loading; also supports indexing via faidx.
mpileupGenerates a textual pileup of aligned bases at each genomic position; for BCF or VCF output suitable for variant calling, use bcftools mpileup. Includes options for handling and base quality adjustments.
mergeCombines multiple sorted alignment files into one, preserving metadata; useful for aggregating results from parallel processing or multi-sample experiments.
markdupIdentifies and flags PCR duplicates in sorted alignments based on mapping position and orientation; outputs updated BAM with duplicate metrics.
statsGenerates comprehensive statistics on alignments, including total reads, mapping rates, insert size distributions, and per-chromosome coverage; aids in quality assessment.
Additional specialized tools, such as fixmate for correcting mate-pair information and calmd for base alignment quality adjustment, extend functionality for targeted tasks like error correction and amplicon analysis. Together, these components facilitate the full spectrum of alignment processing, from import to preparation for tertiary analysis tools.

History

Development Origins

SAMtools originated in late 2008 as part of efforts to standardize the representation and processing of high-throughput sequencing alignments for the , a large-scale initiative to sequence human genomes and identify genetic variants. The project required a flexible format to accommodate diverse sequencing technologies, such as Illumina/Solexa, AB/SOLiD, and Roche/454, and various alignment tools, enabling efficient downstream analyses like variant detection and genotype calling. Prior to this, alignments were often stored in proprietary or tool-specific formats, hindering interoperability and scalability for projects handling up to 10^11 base pairs of data. The Sequence Alignment/Map (SAM) format was conceptualized and named on October 21, 2008, by Heng Li, a key developer from the Sanger Institute, in collaboration with members of the Data Processing Subgroup, including Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Gonçalo Abecasis, and Richard Durbin. Development emphasized streamability for processing large files without full loading into memory, leading to the adoption of fixed columns, optional tags, extended CIGAR strings for representing insertions and deletions, and a binning index for quick lookups on October 22, 2008. By November 3, 2008, a dual text/binary format was established, with the binary BAM format introduced on November 7 to support compression via BGZF (Blocked Zipper Format) for reduced storage and faster access. The final draft, under the , was sent to the on December 8, 2008. SAMtools, the accompanying software suite for manipulating SAM and BAM files, was first publicly released on December 22, 2008, initially supporting file viewing, sorting, indexing, and basic variant calling. This release addressed the immediate needs of the for modular tools that could interface with alignment software and facilitate genomic analyses. The toolkit's design prioritized efficiency and extensibility, allowing integration with emerging sequencing pipelines while maintaining compatibility with the evolving SAM specification. Early development was driven by practical challenges in handling terabyte-scale datasets, with contributions from the broader bioinformatics community shaping its core utilities.

Major Releases

SAMtools' initial release, version 0.1.1, occurred on December 22, 2008, introducing core utilities for converting between the SAM text format and the binary BAM format, sorting alignments, indexing files, and generating pileup data for variant detection. This laid the groundwork for efficient processing of high-throughput sequencing alignments, with early versions emphasizing compact storage and basic manipulation to handle the growing volume of genomic data from projects like the . The 0.x series evolved through over 20 releases up to version 0.1.20 in August 2014, incorporating incremental enhancements such as improved error handling and performance tweaks. A key advancement came in version 0.1.9 (October 2010), which restructured the integrated variant caller into the standalone BCFtools package for better modularity. Version 0.1.19 (March 2013) marked a performance milestone by adding multi-threading support to the 'view' command for sorting and BAM writing, enabling faster processing on multi-core systems. Version 1.0, released on August 15, 2014, represented a major architectural overhaul, splitting the monolithic package into three coordinated projects: HTSlib as the foundational C library for file I/O, BCFtools for variant calling and manipulation, and SAMtools dedicated to alignment-specific operations. This restructuring improved maintainability and interoperability, while introducing native support for the CRAM format—a reference-based compression scheme that reduces file sizes by up to 30% compared to BAM without sacrificing random access efficiency. Automatic detection of input/output formats (SAM, BAM, CRAM) was also added, simplifying user workflows. Post-1.0 releases in the 1.x series prioritized , threading, and integration with modern sequencing paradigms. Version 1.10 (December 2019) introduced on-the-fly indexing during BAM and CRAM writing, eliminating separate indexing steps and accelerating throughput. Enhanced multi-threading across read/write operations followed in version 1.11 ( 2020), alongside new commands like 'ampliconclip' and 'ampliconstats' tailored for amplicon-based sequencing, which clips primers and computes coverage metrics to support targeted resequencing applications. Later major releases have refined compression, sorting, and capabilities. Version 1.18 (July 2023) added minimizer-based sorting for faster coordinate queries and a '--duplicate-count' option in 'markdup' to track PCR duplicates more precisely. Version 1.19 (December 2023) extended the 'coverage' command with '--plot-depth' for visualizing read depth distributions and introduced lexicographical name sorting in 'merge' and 'sort' for consistent handling of multi-sample data. Version 1.20 (April 2024) introduced '--max-depth' for 'bedcov' and support for multiple '-d' options in 'fastq'. Version 1.21 ( 2024) added region filtering for 'cat' and a 'reset' command to remove auxiliary tags. Version 1.22 (May 2025) debuted the 'checksum' command for integrity verification of alignment files and shifted the default CRAM output to version 3.1, leveraging advanced codecs for better compression ratios while maintaining with tools supporting CRAM 3.0. The most recent update, version 1.22.1 (July 2025), primarily addresses bugs, including a use-after-free issue in 'mpileup -a' and buffer overflows in CRAM parsing, ensuring robustness for large-scale genomic analyses.

Supported Formats

SAM Format

The Sequence Alignment/Map (SAM) format is a TAB-delimited text-based standard for representing alignments of sequencing reads against reference sequences, designed to accommodate both short and long reads from diverse high-throughput sequencing platforms. Introduced in 2009 as part of the SAMtools suite, it addresses the need for a unified format amid varying alignment tools and sequencing technologies, such as Illumina and /454, enabling efficient downstream analyses like variant calling and . The format supports up to 128 megabase pairs per read and scales to datasets exceeding 100 gigabase pairs, as demonstrated in its adoption by the for processing large-scale alignments. A SAM file consists of two main sections: an optional header and an alignment section. The header begins with lines prefixed by '@', providing metadata such as entries (e.g., @SQ lines specifying sequence names and lengths), read group information (@RG for sample metadata like platform and library), program details (@PG for analysis steps), and comments (@CO). These header lines ensure reproducibility and facilitate coordinate sorting, with the specification recommending a specific order for tags to maintain compatibility. The alignment section follows, where each line represents a single read alignment or unmapped , using 1-based coordinates for positions to align with biological conventions. Each alignment line in the SAM format includes exactly 11 mandatory fields, separated by tabs, followed by zero or more optional fields. These mandatory fields are:
FieldDescriptionExample
QNAMEQuery name (read identifier), up to 255 charactersread_001
FLAGBitwise flag indicating read properties (e.g., paired, unmapped)0 (unpaired, mapped)
RNAMEReference sequence name, or '*' if unmappedchr1
POS1-based leftmost mapping position, or 0 if unmapped1000
MAPQMapping quality (Phred-scaled probability of random placement), 255 for unavailable60
CIGARConcise Idiosyncratic Gapped Alignment Report string describing matches, insertions, deletions, etc.50M (50 matches)
MRNMMate reference name for paired reads, or '*' if unavailable'=' (same as RNAME)
MPOS1-based position of mate read2000
TLENObserved template length (insert size), signed1000
SEQQuery sequence as a string of ACGTN, or '*' if unavailableAGCT...
QUALASCII-encoded Phred quality scores for SEQ bases (+33 offset), or '*' if unavailable!''*+...
The field is a 12-bit encoding read attributes, such as 0x0001 for paired-end reads, 0x0004 for unmapped reads, and 0x0400 for duplicate-marked reads, allowing tools to filter or alignments based on these properties. The string uses extended operators like 'M' for alignment match/mismatch, 'I' for insertion to the reference, 'D' for deletion from the reference, 'N' for skipped reference regions (e.g., introns), and 'S' for soft-clipped bases outside the aligned region. Optional fields appear after the mandatory ones in the format TAG:TYPE:VALUE, where TAG is a two-character code (e.g., 'NM' for edit distance), TYPE specifies the value format (e.g., 'i' for integer, 'Z' for string, 'f' for float), and VALUE holds the data. Common tags include AS for alignment score, MD for mismatch details, and RG for read group assignment, with over 100 standardized tags defined in the SAMtags specification to support advanced features like structural variant annotations. These optional fields enhance flexibility without mandating their presence, ensuring the format remains lightweight for basic use while extensible for complex analyses. The SAM format's text-based design promotes human readability and interoperability across tools, but for efficiency, it pairs with the binary BAM equivalent, which compresses files (e.g., reducing a 112 Gbp dataset from 116 GB to under 30 GB) while preserving all information for random access via indexing. The specification, initially version 1.0 released in 2013 and maintained by the Global Alliance for Genomics and Health (GA4GH), has been updated to version 1.6 as of November 2024, with ongoing refinements.

BAM Format

The BAM (Binary Alignment/Map) format is a compressed binary representation of the SAM (Sequence Alignment/Map) format, designed for efficient storage and random access to high-throughput sequencing alignment data. Developed as part of the SAMtools suite, BAM retains all information from SAM, including mandatory fields such as query name (QNAME), alignment flag (FLAG), and reference sequence position (POS), while encoding them in a compact binary to reduce file sizes significantly—for instance, compressing 112 Gbp of uncompressed SAM data (approximately 116 GB) to under 30 GB. This format uses little-endian byte order and supports both aligned and unaligned reads, making it suitable for diverse sequencing platforms and read types. A BAM file begins with a 4-byte magic string "BAM\1", followed by a uint32_t indicating the length of the header section, which contains the textual SAM header (e.g., lines starting with @HD or @SQ) in raw form. This is succeeded by a uint32_t specifying the number of reference sequences, each described by a binary entry with the sequence name length (uint32_t), name (null-terminated string), and length (uint32_t). Alignment records follow, each prefixed by a uint32_t block size (excluding the size field itself) and comprising core fields such as reference ID (int32_t, -1 for unmapped), position (0-based int32_t), mapping quality (uint8_t), bin and FLAG (uint16_t), read name length and sequence length (uint16_t each), next reference ID and position (int32_t each), template length (int32_t), and variable-length arrays for CIGAR string (uint32_t array), sequence (packed uint8_t array using 2 bits per base, e.g., A=0, C=1), and quality scores (uint8_t array). Optional tags are encoded as key-value pairs (3-byte tag + 1-byte type + variable value), allowing flexible extension without altering the core structure. BAM files employ BGZF (Blocked Zip Format) compression, a gzip-compatible method that divides data into independent blocks of up to 64 KB, enabling parallel decompression and without full file loading. This compression is handled via the HTSlib integrated with SAMtools, which also facilitates conversion between SAM and BAM (e.g., processing 112 Gbp of data in about 10 hours on standard hardware). For efficient querying, BAM files must be sorted by coordinate and indexed using the BAI (BAM Alignment Index) format, which employs a hierarchical binning system based on genomic regions (e.g., bin 0 covers the entire 512 Mbp , with finer bins down to 8 Kbp) combined with linear offsets for chunks within bins. This indexing allows retrieval of alignments overlapping a specific interval with typically one disk seek, supporting operations like samtools view on regions with low memory overhead (under 30 MB for large datasets). Key differences from SAM include the shift to 0-based positioning (versus SAM's 1-based), binary encoding of sequences and qualities (e.g., bases packed into 4-bit values using the order =ACMGRSVTWYHKDBN), and support for large CIGAR strings via the CG:Z optional tag to avoid overflow in the fixed array. These features make BAM the preferred format for SAMtools workflows, such as sorting (samtools sort), merging, and statistical analysis, where it provides substantial speed and space savings over text-based alternatives. The format's specification, version 1.6 as of November 2024, ensures backward compatibility while accommodating evolving sequencing technologies.

CRAM Format

The CRAM (Compressed Alignment/Map) format is a reference-based columnar storage format designed for high-efficiency compression of biological alignments, offering significant space savings over the BAM format while maintaining full compatibility with the SAM specification. Developed as an extension of the SAM/BAM ecosystem, CRAM encodes alignment data in a way that leverages the to store only differences from the reference , such as substitutions, insertions, and deletions, rather than full read sequences. This approach enables ratios typically 3 to 4 times better than BAM for short-read data, with file sizes reduced by 50-70% in practice for Illumina sequencing outputs. CRAM files are structured as a sequence of containers, beginning with a fixed 26-byte file definition header that identifies the format version and compression level, followed by a CRAM header container storing the SAM header and reference sequences (via checksums for validation). Subsequent containers group alignments into slices—logical units of up to 100,000 records—each preceded by a compression header that defines per-field encoding parameters. Slices consist of a block (a bit-packed of encoded alignment fields) and optional external blocks (byte s for less compressible like read names or quality scores), with all blocks compressed using algorithms such as , LZMA, or rANS entropy coding. The file concludes with an EOF container for integrity verification. This modular design supports via external indexing (e.g., .crai files) and selective decoding, allowing tools to load only relevant slices without decompressing the entire file. Encoding in CRAM is field-specific and adaptive: core alignment attributes (e.g., flags, mapping quality, positions) use variable-length integer encodings like ITF-8 or , while read features (e.g., base substitutions as delta offsets from the reference) are represented as arrays of operations to reconstruct the sequence on demand. External references are mandatory for decoding, but CRAM optionally embeds reference slices for portability, with MD5-based validation to ensure consistency. Version 3.1, released in 2021, introduced advanced codecs including rANS4x16 for faster encoding, adaptive , fqzcomp for quality scores, and a name tokenizer for read identifiers, yielding 7-15% additional compression gains over version 3.0 for high-coverage short reads, and enabling processing speeds up to 3 times faster than equivalent BAM operations in benchmarks like samtools flagstat on large datasets. As of September 2024, CRAM version 3.1 became the default in SAMtools and HTSlib releases. These enhancements are implemented in the HTSlib library, ensuring seamless integration with SAMtools commands such as view, sort, and index for CRAM I/O. Compared to BAM's record-oriented BGZF compression, CRAM's columnar structure and reference dependency reduce redundancy in aligned data, making it particularly advantageous for archival storage in large-scale genomics projects, though it requires reference availability during decoding—a trade-off mitigated by widespread reference standardization. Support for controlled lossy compression (e.g., via quality score binning) further optimizes space for variant calling pipelines without impacting accuracy in downstream analyses. CRAM was first integrated into SAMtools with version 1.0 in , evolving from early prototypes at the and to become the default format in recent releases.

Core Commands

File Manipulation

SAMtools offers a suite of core commands dedicated to file manipulation tasks, enabling users to view, convert, sort, index, merge, and otherwise process alignment files in SAM, BAM, and CRAM formats. These utilities are fundamental for handling high-throughput sequencing data, as they facilitate efficient data preparation, subset extraction, and integration in bioinformatics pipelines without requiring full reloading of large files. Designed for speed and low memory usage, these commands leverage the compressed BAM and CRAM formats to manage terabyte-scale datasets effectively. The samtools view command serves as the primary tool for inspecting and converting alignment files. It extracts and prints alignments from input files to standard output in SAM format by default, but supports output conversion to BAM (-b) or CRAM (-C) via specified options, with the output file designated using -o. For region-specific extraction, an indexed input file is required, allowing queries like samtools view input.bam chr1:10000-20000 to retrieve alignments in a genomic interval. This command is versatile for initial data exploration and format interoperability, often piped to other tools for downstream analysis. Sorting alignments is handled by samtools sort, which rearranges records by coordinate (default) or read name (-n) to produce a sorted BAM output. Essential for indexing and efficient querying, it uses temporary files prefixed by -T and supports multi-threading with -@ for parallel processing on large inputs. For instance, samtools sort -o sorted.bam input.bam generates a coordinate-sorted file suitable for subsequent operations, reducing random access times in variant calling workflows. Once sorted, files can be indexed using samtools index to enable rapid region-based access. This creates a .bai (BAI) or .csi (CSI) index file alongside the input, with -b or -c options selecting the index type; BAI is standard for BAM files under 2^29 bases per reference. The command requires coordinate-sorted input and is non-destructive, as in samtools index sorted.bam, which produces sorted.bam.bai for use with tools like samtools view or genome browsers. Indexing dramatically improves performance on datasets exceeding gigabytes, avoiding sequential scans. Merging multiple sorted files is accomplished with samtools merge, which combines inputs while maintaining sort order and merging headers. It accepts a list of BAM/CRAM files as arguments, outputting to a specified file, and includes options like -n for name-based sorting or -f to force overwriting. An example workflow is samtools merge -o merged.bam sample1.bam sample2.bam, ideal for consolidating lane-level alignments from sequencing runs. For simpler without order preservation, samtools cat joins files of compatible formats using -h for a shared header source. Additional manipulation includes samtools split, which partitions a file by read group (RG) tag into separate outputs prefixed by the input name, as in samtools split merged.bam yielding files like merged.bam.A.1.bam. Header replacement is streamlined by samtools reheader, which applies a new SAM header to a BAM/CRAM file efficiently: samtools reheader newheader.sam input.bam > output.bam. These commands support targeted file restructuring, such as separating samples or correcting metadata, enhancing data organization in multi-sample studies. For read shuffling and grouping, samtools collate prepares name-sorted inputs by collating paired reads together, using samtools collate -o output.bam input.bam to output without full sorting, which aids duplicate marking. Complementing this, samtools fixmate populates mate information in name-sorted files, adding flags and coordinates via samtools fixmate -O bam input.bam output.bam, and optionally removes unmapped mates (-r). Finally, samtools markdup identifies and marks PCR duplicates in coordinate-sorted files, with options to remove them (-r) or output statistics (-s), as in samtools markdup input.bam output.bam. These utilities ensure during manipulation, critical for accurate downstream analyses like variant detection.

Alignment Processing

SAMtools provides a suite of commands for processing alignment files in SAM, BAM, and CRAM formats, enabling tasks such as viewing, filtering, sorting, indexing, and merging alignments to facilitate downstream genomic analysis. These operations are essential for managing high-throughput sequencing data, ensuring efficient access and manipulation of read alignments against reference genomes. The samtools view command is a foundational tool for extracting and filtering alignments from input files. It converts between formats (e.g., BAM to SAM) and restricts output to specific genomic regions using positional arguments or the -L option for files, allowing users to focus on subsets of data without loading entire files into memory. For instance, samtools view -b input.bam chr1:1000-2000 outputs BAM-format alignments for the specified region, supporting rapid querying in large datasets. Sorting and indexing are critical for optimizing alignment files for random access and efficient processing. The samtools sort command rearranges alignments by coordinate (default) or read name (with -n), producing sorted output that is prerequisite for many analyses; it uses temporary files specified by -T to handle large inputs. Following sorting, samtools index generates binary indexes (BAI for BAM with -b, or CSI for larger files with -c), enabling fast region-based retrieval via samtools view without rescanning the entire file. This combination significantly reduces computational overhead in workflows involving repeated region queries. Merging and concatenation support the integration of multiple alignment files, often from parallel processing or multi-sample experiments. The samtools merge command combines sorted files while preserving order and merging headers (optionally from a separate file with -h), suitable for consolidating data from distributed alignments; it includes -f to force overwriting existing outputs. In contrast, samtools cat simply concatenates unsorted files with identical reference dictionaries, providing a lightweight option for appending alignments without re-sorting. Duplicate handling and mate-pair fixing address common artifacts in sequencing data. The samtools markdup command identifies and marks PCR or optical duplicates based on mapping coordinates and orientation, with options like -r to remove them and -s for single-end reads, improving accuracy in variant calling pipelines. Similarly, samtools fixmate updates mate information in name-sorted files, correcting flags and positions for paired-end reads (using -m to mark supplementary alignments), which is vital for downstream paired-end analyses. Additional processing tools include samtools split for dividing files by read group into separate outputs, aiding in sample-specific workflows, and samtools reheader for replacing headers without altering alignment records, useful for updating metadata in CRAM files (with -i for in-place modification). These commands collectively enable robust preprocessing of alignments, ensuring data integrity and compatibility with tools like variant callers.

Statistics and Inspection

SAMtools provides several commands dedicated to generating statistics and inspecting alignment files in SAM, BAM, or CRAM formats, enabling users to assess , alignment coverage, and read properties without extensive processing. These tools are essential for in high-throughput sequencing workflows, offering both summary metrics and detailed views of the data. The primary commands include samtools stats, samtools flagstat, samtools idxstats, and samtools view, each targeting specific aspects of file inspection and statistical analysis. The samtools stats command computes comprehensive statistics from alignment files, producing a text-based report that can be visualized using the accompanying plot-bamstats script. It categorizes metrics into sections such as summary numbers (e.g., total reads, mapped percentage), insert size distributions, coverage depths, and biases, distinguishing between paired and unpaired reads based on SAM flags like PAIRED (0x1), READ1 (0x40), and READ2 (0x80). For instance, it reports averages like insert size and coverage, along with histograms for insert sizes and quality scores, allowing users to identify issues such as library preparation artifacts or sequencing biases. Options like -c for custom coverage ranges (default: 1-1000) or -d to exclude duplicates enable targeted analysis, and the output supports region-specific queries when the input is indexed. This command is particularly useful for overall file inspection, as it processes the entire file or specified regions efficiently. For flag-based inspection, samtools flagstat analyzes the FLAG field of alignments according to the SAM specification, counting reads across 13 categories such as total, mapped, paired, duplicates, and QC failures. It outputs counts and percentages for primary, secondary, and supplementary alignments, split by QC pass/fail status (FLAG 0x200), providing a quick overview of mapping quality and potential artifacts like unmapped or improperly paired reads. The default output is human-readable (e.g., "122 + 28 in total"), but it supports JSON or TSV formats for programmatic use, with multi-threading via -@ for large files. This tool is lightweight and runs in a single pass, making it ideal for rapid quality checks during alignment pipelines. The samtools idxstats command retrieves per-reference-sequence statistics from an indexed BAM file, reporting the reference name, length, number of mapped read segments, and unmapped read segments in a TAB-delimited format. It requires prior indexing with samtools index for efficiency, though unindexed files can be processed by full scan (slower for large datasets). This is valuable for inspecting alignment distribution across chromosomes or contigs, highlighting uneven coverage or unmapped portions, and it may overcount multi-mapped or fragmented reads. An example output line might read "* 0 0 100" for unmapped reads, aiding in decisions about downstream filtering. Inspection at the alignment level is facilitated by samtools view, which extracts and displays records from files, supporting filtering by , mapping (via -q), flags (include/exclude with -f/-F), or tags. For coordinate-sorted and indexed inputs, it enables fast random access to regions (e.g., samtools view input.bam chr1:1000-2000), outputting in SAM, BAM, or CRAM formats. Options like -h include headers, -c counts matches without printing, and -L uses files for targeted viewing, making it a versatile tool for detailed examination of specific alignments or conversion during workflows.

Usage and Examples

Basic Usage

SAMtools provides a straightforward for essential operations on alignment files in SAM, BAM, and CRAM formats, such as conversion, sorting, indexing, and viewing. These core functionalities enable users to process high-throughput sequencing data efficiently, often in a Unix using standard . For instance, input files can be piped from other tools, and remote files accessed via URLs like FTP or HTTP. Basic operations require the software to be installed and typically involve specifying input files, output options, and optional flags for filtering or formatting. The view command is fundamental for inspecting, converting, and filtering alignments. It reads SAM, BAM, or CRAM files and outputs in SAM format by default, but can produce BAM or CRAM with flags. To convert a SAM file to BAM, use:

samtools view -b input.sam > output.bam

samtools view -b input.sam > output.bam

This compresses the text-based SAM into the binary BAM format for compact storage. For viewing the first few alignments without conversion, pipe to head:

samtools view input.bam | head -5

samtools view input.bam | head -5

Region-specific extraction requires a coordinate-sorted and indexed input file, such as retrieving reads from :10000-20000:

samtools view -b input.bam "1:10000-20000" > region.bam

samtools view -b input.bam "1:10000-20000" > region.bam

Filtering by read group or flags (e.g., mapped reads only) is achieved with options like -r or -q. Sorting alignments by genomic coordinates is a prerequisite for many downstream analyses, including indexing and visualization. The sort command rearranges records in a BAM or CRAM file, outputting a new sorted file. A basic example sorts by leftmost coordinate:

samtools sort -o output.sorted.bam input.bam

samtools sort -o output.sorted.bam input.bam

This uses up to 768 MiB of per thread by default and supports multi-threading with -@ for large files; for example, with 4 threads:

samtools sort -@ 4 -o output.sorted.bam input.bam

samtools sort -@ 4 -o output.sorted.bam input.bam

Sorting by read name instead (useful for paired-end ) adds the -n flag. The process is memory-efficient, handling datasets up to hundreds of gigabases with modest resources. Indexing a sorted BAM or CRAM file enables rapid random access to specific genomic regions without scanning the entire file. The index command generates a binary index file (BAI for BAM, CRAI for CRAM). For a sorted BAM:

samtools index output.sorted.bam

samtools index output.sorted.bam

This creates output.sorted.bam.bai alongside the input. For compressed SAM (SAM.gz), the same command applies, producing a .bai index. CSI indices (for larger intervals) can be specified with -c, suitable for very large files. Once indexed, commands like view leverage it for efficient querying. Basic statistics on alignment files can be generated with flagstat, providing counts of mapped, unmapped, and duplicate reads. Run:

samtools flagstat input.bam

samtools flagstat input.bam

This outputs a summary report, such as total reads and mapping rates, essential for assessment. These operations form the foundation of SAMtools workflows, often chained together for data preparation in pipelines.

Common Workflows

One of the most prevalent workflows in high-throughput sequencing analysis involves converting raw FASTQ files to compressed BAM or CRAM formats for efficient storage and . This typically begins with alignment of reads to a using tools like BWA-MEM or minimap2, producing a SAM file that is then processed with SAMtools commands to fix mate-pair information, sort by genomic position, mark duplicates, and convert to the desired format. For instance, the fixmate command resolves pairing issues in paired-end data, while sort ensures coordinate ordering essential for indexing and variant calling. The markdup step identifies and flags PCR duplicates to avoid biases in coverage estimates. This workflow reduces file sizes significantly—CRAM can achieve file sizes 23-55% of equivalent BAM files (45-77% smaller), depending on the dataset and compression settings, while maintaining reference-based efficiency—and is foundational for whole-genome sequencing pipelines. A representative pipeline for this conversion pipes the aligner output directly into SAMtools for streaming processing: minimap2 -a -x sr reference.fa reads.fastq | samtools fixmate -m - - | samtools sort -@ 8 -T /tmp/temp - | samtools markdup -r - final.bam, which avoids intermediate files and leverages multi-threading for speed on large datasets. Conversion to CRAM requires specifying the reference with -T reference.fa in the view command, enabling reference-dependent compression that embeds differences rather than full sequences. This approach is particularly useful in resource-constrained environments, as CRAM files can be decoded on-the-fly without full decompression. For whole-genome sequencing (WGS) or whole-exome sequencing (WES), a standard extends the alignment process into variant calling and refinement. After initial mapping with BWA-MEM to produce sorted BAM files via SAMtools sort and fixmate, base quality score recalibration (BQSR) and duplicate marking are applied using external tools like GATK, followed by merging multiple lanes with SAMtools merge. Variant calling then uses BCFtools mpileup to generate pileup data from the BAM, piped into bcftools call for : bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -mv -Ob -o calls.bcf. This produces a binary BCF file containing high-confidence SNPs and indels, indexed with tabix for quick querying. The emphasizes quality filtering during calling, such as skipping low-depth regions with -d 5 in mpileup, to balance in detecting variants at ~30x coverage typical for WGS. Post-calling, VCF filtering refines variants using BCFtools to remove artifacts based on quality metrics like QUAL score, read depth (DP), and strand bias (SP). A common post-call filter excludes low-quality sites with bcftools filter -i 'QUAL>20 && DP>10' -Ob -o filtered.bcf calls.bcf, separating SNPs and indels via TYPE annotations for tailored rules—e.g., indels require minimum supporting reads (IDV > 2) to mitigate alignment errors. Pre-call options in mpileup, such as -L 250 for maximum depth, prevent over-calling in high-coverage regions. This step is crucial for reducing false positives, with empirical thresholds often tuned against truth sets using bcftools isec to compare true/false positives, improving precision in well-calibrated datasets. Integration with tools follows for functional . CRAM-specific workflows optimize for reference-dependent storage in collaborative or archival settings, requiring alignments to be position-sorted before encoding to maintain compression ratios around 1:4 to 1:6 versus uncompressed SAM. As of version 1.22 (July 2025), SAMtools defaults to CRAM 3.1, which provides further compression improvements over previous versions. Key commands include viewing with samtools view -T ref.fa cram.cram for on-demand decoding and mpileup directly on CRAM for pileups without conversion. Best practices mandate embedding reference hashes in headers via the aligner's -R option and setting environment variables like REF_PATH for remote reference fetching, ensuring seamless access in distributed systems like the European Nucleotide Archive. This format supports workflows where storage is a bottleneck, as partial decoding reduces I/O compared to BAM, though it demands consistent reference availability to avoid decoding failures. Basic inspection and form another routine , often preceding analysis. After indexing a BAM with samtools index aligned.bam, the flagstat command summarizes alignment metrics: samtools flagstat aligned.bam, reporting total reads, mapped percentage (typically >95% for good ), and duplicates. Depth statistics via samtools depth -a aligned.bam > coverage.txt quantify per-position coverage, aiding in identifying biases. These steps, executable in seconds on gigabyte-scale files, provide essential diagnostics without full reprocessing.

Integration and Extensions

With HTSlib and BCFtools

SAMtools, BCFtools, and HTSlib form a tightly integrated ecosystem for processing high-throughput sequencing data, with HTSlib serving as the foundational library that enables efficient reading, writing, and manipulation of formats such as SAM/BAM/CRAM for alignments and VCF/BCF for variants. HTSlib provides a unified for these operations, allowing SAMtools and BCFtools to share core functionalities like binary format handling, indexing, and multi-threading support, which enhances performance and ensures compatibility across the tools. This integration originated from the restructuring of the original SAMtools project, where HTSlib was extracted as a standalone to facilitate independent development and third-party embedding, while SAMtools focused on alignment processing and BCFtools on variant calling. In practice, SAMtools depends on HTSlib for all file I/O operations, such as sorting, merging, and indexing BAM files, with source distributions including bundled HTSlib copies for standalone builds. Similarly, BCFtools leverages HTSlib for VCF/BCF manipulation, including conversion between text and binary formats, enabling seamless handling of large-scale datasets. This shared dependency minimizes code duplication—HTSlib provides the core functionality supporting both tools—and allows updates like in-file indexing (introduced in version 1.10) to propagate efficiently across SAMtools and BCFtools. A primary example of integration occurs in variant calling , where SAMtools prepares aligned BAM files that serve as input to BCFtools' mpileup command, which generates pileups and likelihoods using HTSlib for efficient data access. For instance, after alignment and sorting with samtools sort and samtools index, the pipeline proceeds to bcftools mpileup -f ref.fa input.bam | bcftools call -mv -Ob -o calls.bcf, producing a compressed BCF file of variants; this one-liner combines pileup generation and calling, relying on HTSlib's threading for speed on multi-core systems. Such , common in whole-genome sequencing (WGS) and analysis, demonstrate how the tools interoperate: SAMtools handles preprocessing and of alignments, while BCFtools performs downstream variant detection and filtering, all underpinned by HTSlib's format-agnostic efficiency. Historically, BCFtools evolved from SAMtools' variant-calling components (e.g., the original mpileup and call in SAMtools 0.1.9, ), becoming independent in to better support multi-sample and gVCF formats. This modular design has enabled the ecosystem's growth, with extensive ongoing development fostering high-impact applications in research while maintaining low memory usage and platform independence. As of mid-2025, the latest releases are SAMtools 1.22.1, BCFtools 1.22, and HTSlib 1.22.1, continuing to support evolving sequencing technologies and pipelines.

With Other Bioinformatics Tools

SAMtools is frequently integrated into next-generation sequencing (NGS) workflows alongside aligners such as BWA, where BWA generates initial SAM files from read alignments to a , and SAMtools subsequently converts these to compressed BAM format, sorts them, and generates indices for efficient . This integration ensures compatibility in standard pipelines, as BAM files produced by SAMtools are directly usable by BWA's post-alignment steps or further tools. In variant calling pipelines, SAMtools pairs with the Genome Analysis Toolkit (GATK) by providing pre-processed BAM files—sorted, indexed, and filtered for mapping quality—that serve as input for GATK's HaplotypeCaller or other callers, enabling accurate identification of single nucleotide variants and indels. For instance, after alignment with BWA and duplicate marking, SAMtools' view and index commands prepare files that GATK requires for base quality score recalibration and joint genotyping across samples. This combination has become a cornerstone of best practices for high-confidence variant detection in clinical and research sequencing. SAMtools also complements Picard tools for quality control and duplicate handling; while Picard’s MarkDuplicates identifies and tags PCR or optical duplicates in BAM files, SAMtools can preprocess or postprocess these files via sorting (samtools sort) or fixing mate pairs (samtools fixmate), ensuring seamless in pipelines that prioritize duplicate removal to reduce in downstream analyses. Although newer SAMtools versions include a markdup command as an alternative, Picard remains preferred in GATK-centric workflows for its robust handling of read groups and metrics reporting. For visualization, SAMtools generates indices (via samtools index) that enable loading of BAM or CRAM files into the Integrative Genomics Viewer (IGV), allowing interactive inspection of alignments, coverage, and variants without file conversion. This integration supports exploratory analysis, where users can correlate SAMtools-derived statistics (e.g., from samtools stats) with IGV's graphical overlays to validate pipeline outputs. Beyond these, SAMtools interfaces with tools like BEDTools for intersection-based operations on alignments and regions, where indexed BAM files from SAMtools feed into BEDTools' bamToBed or coverage calculations, facilitating tasks such as peak calling integration in ChIP-seq workflows. Overall, SAMtools' standardized formats and Unix-pipe compatibility make it a foundational component in modular pipelines, often scripted with Nextflow or Snakemake for reproducible integration across diverse tools.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.