Hubbry Logo
SequencingSequencingMain
Open search
Sequencing
Community hub
Sequencing
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Sequencing
Sequencing
from Wikipedia

In genetics and biochemistry, sequencing means to determine the primary structure (sometimes incorrectly called the primary sequence) of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succinctly summarizes much of the atomic-level structure of the sequenced molecule.

DNA sequencing

[edit]

DNA sequencing is the process of determining the nucleotide order of a given DNA fragment. So far, most DNA sequencing has been performed using the chain termination method developed by Frederick Sanger. This technique uses sequence-specific termination of a DNA synthesis reaction using modified nucleotide substrates. However, new sequencing technologies such as pyrosequencing are gaining an increasing share of the sequencing market. More genome data are now being produced by pyrosequencing than Sanger DNA sequencing. Pyrosequencing has enabled rapid genome sequencing. Bacterial genomes can be sequenced in a single run with several times coverage with this technique. This technique was also used to sequence the genome of James Watson recently.[1]

The sequence of DNA encodes the necessary information for living things to survive and reproduce. Determining the sequence is therefore useful in fundamental research into why and how organisms live, as well as in applied subjects. Because of the key importance DNA has to living things, knowledge of DNA sequences is useful in practically any area of biological research. For example, in medicine it can be used to identify, diagnose, and potentially develop treatments for genetic diseases. Similarly, research into pathogens may lead to treatments for contagious diseases. Biotechnology is a burgeoning discipline, with the potential for many useful products and services.

The Carlson curve is a term coined by The Economist [2] to describe the biotechnological analog of Moore's law, and is named after author Rob Carlson.[3] Carlson accurately predicted the doubling time of DNA sequencing technologies (measured by cost and performance) would be at least as fast as Moore's law.[4] Carlson curves illustrate the rapid (in some cases hyperexponential) decrease in cost and increase in performance of a variety of technologies, including DNA sequencing, DNA synthesis, and a range of physical and computational tools used in protein expression and in determining protein structures.

Sanger sequencing

[edit]
Part of a radioactively labelled sequencing gel

In chain terminator sequencing (Sanger sequencing), extension is initiated at a specific site on the template DNA by using a short oligonucleotide 'primer' complementary to the template at that region. The oligonucleotide primer is extended using a DNA polymerase, an enzyme that replicates DNA. Included with the primer and DNA polymerase are the four deoxynucleotide bases (DNA building blocks), along with a low concentration of a chain terminating nucleotide (most commonly a di-deoxynucleotide). The deoxynucleotides lack in the OH group both at the 2' and at the 3' position of the ribose molecule, therefore once they are inserted within a DNA molecule they prevent it from being further elongated. In this sequencer four different vessels are employed, each containing only of the four dideoxyribonucleotides; the incorporation of the chain terminating nucleotides by the DNA polymerase in a random position results in a series of related DNA fragments, of different sizes, that terminate with a given dideoxiribonucleotide. The fragments are then size-separated by electrophoresis in a slab polyacrylamide gel, or more commonly now, in a narrow glass tube (capillary) filled with a viscous polymer.

View of the start of an example dye-terminator read (click to expand)

An alternative to the labelling of the primer is to label the terminators instead, commonly called 'dye terminator sequencing'. The major advantage of this approach is the complete sequencing set can be performed in a single reaction, rather than the four needed with the labeled-primer approach. This is accomplished by labelling each of the dideoxynucleotide chain-terminators with a separate fluorescent dye, which fluoresces at a different wavelength. This method is easier and quicker than the dye primer approach, but may produce more uneven data peaks (different heights), due to a template dependent difference in the incorporation of the large dye chain-terminators. This problem has been significantly reduced with the introduction of new enzymes and dyes that minimize incorporation variability. This method is now used for the vast majority of sequencing reactions as it is both simpler and cheaper. The major reason for this is that the primers do not have to be separately labelled (which can be a significant expense for a single-use custom primer), although this is less of a concern with frequently used 'universal' primers. This is changing rapidly due to the increasing cost-effectiveness of second- and third-generation systems from Illumina, 454, ABI, Helicos, and Dover.

Pyrosequencing

[edit]

The pyrosequencing method is based on the detection of the pyrophosphate release on nucleotide incorporation. Before performing pyrosequencing, the DNA strand to sequence has to be amplified by PCR. Then the order in which the nucleotides have to be added in the sequencer is chosen (i.e. G-A-T-C). When a specific nucleotide is added, if the DNA polymerase incorporates it in the growing chain, the pyrophosphate is released and converted into ATP by ATP sulfurylase. ATP powers the oxidation of luciferase through the luciferase; this reaction generates a light signal recorded as a pyrogram peak. In this way, the nucleotide incorporation is correlated to a signal. The light signal is proportional to the amount of nucleotides incorporated during the synthesis of the DNA strand (i.e. two nucleotides incorporated correspond to two pyrogram peaks). When the added nucleotides aren't incorporated in the DNA molecule, no signal is recorded; the enzyme apyrase removes any unincorporated nucleotide remaining in the reaction. This method requires neither fluorescently-labelled nucleotides nor gel electrophoresis. Pyrosequencing, which was developed by Pål Nyrén and Mostafa Ronaghi DNA, has been commercialized by Biotage (for low-throughput sequencing) and 454 Life Sciences (for high-throughput sequencing). The latter platform sequences roughly 100 megabases [now up to 400 megabases] in a seven-hour run with a single machine. In the array-based method (commercialized by 454 Life Sciences), single-stranded DNA is annealed to beads and amplified via EmPCR. These DNA-bound beads are then placed into wells on a fiber-optic chip along with enzymes which produce light in the presence of ATP. When free nucleotides are washed over this chip, light is produced as ATP is generated when nucleotides join with their complementary base pairs. Addition of one (or more) nucleotide(s) results in a reaction that generates a light signal that is recorded by the CCD camera in the instrument. The signal strength is proportional to the number of nucleotides, for example, homopolymer stretches, incorporated in a single nucleotide flow. [1]

True single molecule sequencing

[edit]

Large-scale sequencing

[edit]

Whereas the methods above describe various sequencing methods, separate related terms are used when a large portion of a genome is sequenced. Several platforms were developed to perform exome sequencing (a subset of all DNA across all chromosomes that encode genes) or whole genome sequencing (sequencing of the all nuclear DNA of a human).

RNA sequencing

[edit]

RNA is less stable in the cell, and also more prone to nuclease attack experimentally. As RNA is generated by transcription from DNA, the information is already present in the cell's DNA. However, it is sometimes desirable to sequence RNA molecules. While sequencing DNA gives a genetic profile of an organism, sequencing RNA reflects only the sequences that are actively expressed in the cells. To sequence RNA, the usual method is first to reverse transcribe the RNA extracted from the sample to generate cDNA fragments. This can then be sequenced as described above. The bulk of RNA expressed in cells are ribosomal RNAs or small RNAs, detrimental for cellular translation, but often not the focus of a study. This fraction can be removed in vitro, however, to enrich for the messenger RNA, also included, that usually is of interest. Derived from the exons these mRNAs are to be later translated to proteins that support particular cellular functions. The expression profile therefore indicates cellular activity, particularly desired in the studies of diseases, cellular behaviour, responses to reagents or stimuli. Eukaryotic RNA molecules are not necessarily co-linear with their DNA template, as introns are excised. This gives a certain complexity to map the read sequences back to the genome and thereby identify their origin. For more information on the capabilities of next-generation sequencing applied to whole transcriptomes see: RNA-Seq and MicroRNA Sequencing.

Protein sequencing

[edit]

Methods for performing protein sequencing include:

If the gene encoding the protein is known, it is currently much easier to sequence the DNA and infer the protein sequence. Determining part of a protein's amino-acid sequence (often one end) by one of the above methods may be sufficient to identify a clone carrying this gene.

Polysaccharide sequencing

[edit]

Though polysaccharides are also biopolymers, it is not so common to talk of 'sequencing' a polysaccharide, for several reasons. Although many polysaccharides are linear, many have branches. Many different units (individual monosaccharides) can be used, and bonded in different ways. However, the main theoretical reason is that whereas the other polymers listed here are primarily generated in a 'template-dependent' manner by one processive enzyme, each individual join in a polysaccharide may be formed by a different enzyme. In many cases the assembly is not uniquely specified; depending on which enzyme acts, one of several different units may be incorporated. This can lead to a family of similar molecules being formed. This is particularly true for plant polysaccharides. Methods for the structure determination of oligosaccharides and polysaccharides include NMR spectroscopy and methylation analysis.[5]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Sequencing, in the context of molecular biology, refers to the laboratory process of determining the precise order of monomers in biomolecules, such as nucleotides—adenine (A), cytosine (C), guanine (G), and thymine (T) in DNA or uracil (U) in RNA—or amino acids in proteins, and extending to other molecules like polysaccharides. Nucleic acid sequencing, which decodes genetic information, is fundamental to understanding heredity, gene function, and evolutionary relationships, while protein sequencing aids in studying structure-function relationships. The development of sequencing techniques began in the 1970s, primarily for nucleic acids, with pioneering methods like the chemical degradation approach by Allan Maxam and Walter Gilbert in 1977, and the chain-termination method by Frederick Sanger, Allan Coulson, and colleagues, also in 1977, which became the gold standard for accurate, long-read sequencing of clonal DNA populations. A major milestone was the Human Genome Project, launched in 1990 and completed in 2003, which sequenced the approximately 3 billion base pairs of the human genome using enhanced Sanger sequencing, paving the way for genomics as a field and reducing sequencing costs from billions to millions of dollars. In the 2000s, second-generation or next-generation sequencing (NGS) technologies emerged, such as those from 454 Life Sciences (2005) and Illumina (2007), enabling massively parallel sequencing of millions of short DNA fragments simultaneously, which dramatically lowered costs to under $1,000 per human genome by 2015 and expanded throughput for large-scale studies. Third-generation methods, including single-molecule real-time sequencing by Pacific Biosciences (PacBio, introduced 2010) and nanopore sequencing by Oxford Nanopore Technologies (commercialized 2014), further advanced the field by providing long-read capabilities—up to millions of base pairs as of 2025—without the need for fragmentation or amplification, improving accuracy in resolving complex genomic regions like repeats and structural variants. Sequencing has transformative applications across medicine, research, and beyond. In healthcare, it supports diagnosing rare genetic diseases, identifying cancer mutations for targeted therapies, and enabling through . In infectious , whole- sequencing tracks , as seen in monitoring SARS-CoV-2 variants during the . Forensically, it aids in human identification via analysis and kinship testing. In basic science, sequencing facilitates de novo assembly for non-model organisms, of microbial communities, evolutionary studies by comparing sequences across species, and proteomic analysis for protein identification and modification mapping. Ongoing advancements, such as integration with for and reductions in sequencing costs to approximately $200 for a (or about 0.000000067 USD per ) as of 2025, continue to broaden its accessibility and impact.

Historical Development

Early Methods and Discoveries

The elucidation of the double-helical structure of DNA by and in 1953 provided the foundational understanding of genetic information storage, underscoring the need for methods to determine sequences directly. Pioneering work in biomolecular sequencing began with proteins, as British biochemist achieved the first complete sequence of a protein in 1955 by analyzing insulin, a 51-residue comprising two polypeptide chains linked by bonds. Sanger employed and partial acid hydrolysis to generate peptide fragments, followed by identification via analysis and end-group determination, revealing the precise order of residues in insulin's A and B chains. This manual approach, which relied on of and two-dimensional for separation, demonstrated that proteins possess unique linear sequences dictating their function, earning Sanger the 1958 . Early efforts to sequence RNA emerged in the 1960s, focusing on smaller molecules like transfer RNAs (tRNAs) through enzymatic digestion and separation techniques. In 1965, Robert Holley and colleagues determined the complete 77-nucleotide sequence of yeast alanine tRNA using ribonuclease T1 for base-specific cleavage at guanosine residues, followed by snake venom phosphodiesterase for stepwise degradation from the 3' end, with fragments resolved via two-dimensional electrophoresis and chromatography. This breakthrough, which identified modified bases and cloverleaf secondary structure elements, marked the first full nucleic acid sequence and contributed to Holley's 1968 Nobel Prize in Physiology or Medicine. Subsequent work by Walter Fiers' group in the 1970s on bacteriophage MS2 RNA built on these methods, applying 32P-labeling and partial enzymatic hydrolysis to map oligonucleotides and culminating in the complete 3,569-nucleotide sequence in 1976, laying groundwork for viral genome sequencing. The advent of DNA sequencing came in 1977 with the chemical cleavage method developed by Allan Maxam and , which enabled direct reading of nucleotide orders in DNA fragments up to about 200 bases long. The technique involves end-labeling DNA with 32P, followed by base-specific chemical modifications—such as dimethyl sulfate methylation of N7, which creates apurinic sites susceptible to piperidine-induced strand breakage—and resolution of the resulting fragments by , allowing sequence reconstruction from band patterns. was used for pyrimidines (with salt to selectively target ), while modified purines, providing lanes for G, A+G, C+T, and C reactions. This innovation, which earned Gilbert the 1980 (shared with Sanger for his enzymatic method), transformed by facilitating gene cloning and mapping. Despite their groundbreaking impact, these early sequencing methods were highly labor-intensive, requiring meticulous chemical handling, radioactive labeling, and manual gel interpretation, which introduced errors from band misalignment or faint signals. Read lengths were limited to 100-200 bases due to resolution constraints in , restricting applications to short DNA segments and necessitating overlapping clones for longer sequences. The reliance on hazardous reagents like and further complicated scalability, paving the way for enzymatic alternatives.

Key Milestones in Biomolecular Sequencing

The development of in 1977 by and colleagues marked a pivotal advancement in biomolecular sequencing, introducing chain-termination methods that enabled the determination of DNA nucleotide sequences with greater precision and efficiency than prior techniques. This method relied on dideoxynucleotides to halt at specific points, producing fragments that could be separated by to reveal the sequence. By the 1980s, automation transformed its practicality through the adoption of fluorescently labeled dideoxynucleotides and , allowing for simultaneous detection of all four bases in a single reaction and increasing throughput from manual slab gels to hundreds of bases per sample daily. These enhancements, pioneered by instruments like the 370A in 1987, laid the foundation for large-scale genomic projects by reducing labor and error rates. The (HGP), spanning 1990 to 2003, represented a monumental leap in scaling sequencing to genome-wide analysis, achieving approximately 99% coverage of the euchromatic human genome at an accuracy of over 99.99%. Costing roughly $3 billion, the international effort coordinated by the U.S. and collaborators utilized hierarchical shotgun sequencing, which involved mapping large-insert clones like bacterial artificial chromosomes before fragmenting and assembling sequences to minimize gaps in repetitive regions. This structured approach produced a reference sequence of about 2.85 billion base pairs, catalyzing advancements in by demonstrating the feasibility of whole-genome sequencing and enabling subsequent discoveries in disease-associated variants. The project's success highlighted the need for international collaboration and standardized , influencing global sequencing initiatives. The advent of next-generation sequencing (NGS) in 2005 with the commercialization of 454 technology by 454 Life Sciences (later acquired by in 2007) dramatically boosted throughput to around 20 million bases per run, enabling parallel analysis of millions of DNA fragments. This -based method detected nucleotide incorporation via light emission from pyrophosphate release, facilitating de novo assembly of microbial genomes and metagenomic studies at a fraction of Sanger's cost per base. By shifting from Sanger's low-throughput capillary runs to massively parallel bead-based reactions, 454 sequencing expanded applications to environmental and clinical samples, setting the stage for routine high-volume data generation in the late 2000s. Third-generation sequencing emerged in 2010 with Pacific Biosciences' (PacBio) Single Molecule Real-Time (SMRT) technology, which debuted in beta form and achieved full commercialization by 2011, offering real-time observation of single DNA polymerase molecules for continuous reads up to 10 kilobases without amplification biases. Using zero-mode waveguides to isolate individual molecules and fluorescently labeled reversible terminators, SMRT sequencing provided longer contiguous reads ideal for resolving structural variants and repetitive elements that challenged shorter NGS reads. This innovation improved assembly accuracy for complex genomes, such as those with high heterozygosity, and supported direct detection of base modifications like methylation during sequencing. In the 2020s, Oxford Nanopore Technologies advanced portable sequencing with devices like the MinION, a palm-sized instrument enabling real-time analysis in field settings such as outbreak surveillance and remote diagnostics, producing up to 48 gigabases of data per run through nanopore-based detection of ionic current changes as DNA translocates. By 2023, improvements in pore chemistry and basecalling algorithms yielded ultra-long reads exceeding 2.3 megabases (with records over 4 Mb), facilitating complete assembly of bacterial plasmids and human chromosomes without short-read scaffolding. These developments democratized sequencing by reducing equipment size and cost, supporting applications from infectious disease monitoring to biodiversity assessment in non-laboratory environments. As of 2025, further advancements include Roche's introduction of sequencing by expansion technology for enhanced throughput and Illumina's MiSeq i100 series for faster, lower-cost short-read sequencing, continuing to lower barriers to genomic analysis.

Fundamental Principles

Core Concepts and Techniques

Sequencing refers to the process of determining the precise linear order of monomeric units, such as in nucleic acids or in proteins, within to elucidate their primary structure and functional properties. This fundamental approach underpins the analysis of biomolecular sequences across diverse applications in biology and medicine, enabling the reconstruction of genetic and proteomic information from fragmented data. Core techniques in sequencing vary by method. In amplification-based approaches, such as Sanger and next-generation sequencing, template preparation involves denaturation to separate double-stranded biopolymers into single strands and immobilization on a solid support to facilitate subsequent reactions. Primer annealing follows, where short oligonucleotide primers hybridize to complementary sequences on the template, providing a defined starting point for enzymatic extension. Chain elongation or termination then occurs, often driven by polymerase enzymes that synthesize complementary strands or incorporate chain-terminating analogs to generate fragments of varying lengths for analysis. In contrast, single-molecule methods like nanopore sequencing analyze native DNA or RNA strands without denaturation, primers, or amplification. Signal detection in sequencing employs multiple modes to capture and interpret the sequence information. Optical methods rely on or light emission, where labeled emit detectable signals during incorporation, allowing real-time visualization of sequence progression. Electrical detection, such as in nanopore-based systems, measures changes in ionic current as pass through a nanoscale pore, producing distinct electrical signatures for identification. Mass-based approaches, commonly used in proteomic sequencing, involve fragmentation and measurement of mass-to-charge ratios to infer identities from spectral patterns. Errors in sequencing arise from various sources, including difficulties in resolving homopolymer runs—consecutive identical monomers that can lead to insertion or deletion inaccuracies during base calling—and inherent base-calling inaccuracies due to signal noise or enzymatic inefficiencies. These issues are quantified using Phred quality scores (Q-scores), where a Q-score greater than 30 indicates a 99.9% probability of correct base calling, serving as a standard benchmark for sequencing reliability. Bioinformatics plays a crucial role in processing raw sequencing data, with assembly algorithms like de Bruijn graphs breaking reads into k-mers (short subsequences) and reconstructing the original by finding Eulerian paths in the resulting graph, particularly effective for short-read data. Alignment tools based on principles like those in BLAST facilitate comparing query sequences against reference databases through local alignments, using scoring matrices to identify similarities and gaps.

Detection and Analysis Methods

Sequencing processes generate raw data in the form of reads, which are contiguous sequences of nucleotides or amino acids derived from the biomolecule. Short reads, typically ranging from 50 to 300 base pairs (bp), are produced by platforms like Illumina, offering high accuracy of approximately 99.9% due to their reliance on amplification and precise optical detection, but they limit the resolution of repetitive or complex regions. In contrast, long reads exceeding 1 kilobase (kb), such as those from Pacific Biosciences or Oxford Nanopore Technologies, enable spanning of structural variations and repeats, with per-base accuracy exceeding 99% (as of 2025), though computational error correction may still be applied for specific applications. These trade-offs influence downstream analysis, with short reads excelling in uniform coverage for variant detection and long reads facilitating de novo assembly of challenging genomes. Base calling converts raw signals—such as fluorescence intensities or ionic current changes—into sequence data using algorithmic models tailored to the platform. For Illumina sequencing, neural network-based approaches in the Real-Time Analysis (RTA) software process image data to correct errors from overlapping clusters and phasing issues, achieving improved accuracy over traditional Bustard methods. In Oxford Nanopore sequencing, hidden Markov models (HMMs) were initially employed for segmenting current signals into events and assigning bases, evolving to support adaptive sampling where translocation is dynamically controlled based on preliminary base calls. These algorithms integrate probabilistic modeling to handle noise, with deep learning variants now predominant for both platforms to enhance speed and precision in real-time processing. Quality control ensures the reliability of sequencing output through standardized metrics applied post-base calling. The quantifies base call accuracy, defined as Q=10log10(Perror)Q = -10 \log_{10} (P_{\text{error}}) where PerrorP_{\text{error}} is the estimated probability of an incorrect base; scores above 30 indicate error rates below 0.1%, guiding read trimming and filtering. Coverage depth, the average number of reads overlapping a genomic position, is critical for variant detection; for the , 30x depth provides sufficient redundancy to achieve over 99% sensitivity for single-nucleotide variants while minimizing false positives. Tools like FastQC visualize these metrics, flagging biases such as adapter contamination or GC imbalance to refine datasets before . Variant calling identifies differences from a reference , leveraging alignment tools to detect polymorphisms. For single-nucleotide variants (SNVs), employs pileup-based methods to compute likelihoods from aligned reads, integrating Bayesian models for high-confidence calls in short-read data. Structural variants, including insertions, deletions, and inversions, benefit from long-read phasing, where haplotype-resolved assemblies disambiguate complex rearrangements that short reads often miss. Hybrid approaches combining short- and long-read data further enhance precision, as seen in tools like Sniffles for Nanopore-derived calls. Genome assembly reconstructs the original sequence from overlapping reads, facing significant challenges in resolving repeats longer than read lengths. Tandem and interspersed repeats cause ambiguities in overlap-layout-consensus (OLC) or algorithms, leading to fragmented contigs or chimeric scaffolds. Mate-pair libraries, with inserts spanning 2-10 kb, provide long-range linking information to scaffold contigs across repeats, improving contiguity in short-read assemblies. complements this by generating high-resolution restriction maps of DNA molecules, aligning them to scaffolds for validation and repeat resolution without sequencing errors. These methods collectively address assembly gaps, though long-read technologies increasingly mitigate repeat issues through direct spanning.

Nucleic Acid Sequencing

DNA Sequencing Technologies

DNA sequencing technologies have evolved through distinct generations, each improving on throughput, accuracy, and read length to enable comprehensive genomic analysis. The first-generation method, developed by Frederick Sanger and colleagues in 1977, relies on chain termination using dideoxynucleotides (ddNTPs) that incorporate into growing DNA strands, halting synthesis at random positions corresponding to each base. These terminated fragments are separated by size via gel electrophoresis, allowing determination of the nucleotide sequence based on fragment lengths. This Sanger dideoxy method produces read lengths of 500-1000 base pairs (bp) and remains relevant for targeted validation, with costs around $0.005 per base as of 2025 due to its simplicity and low setup requirements. Second-generation technologies, collectively known as next-generation sequencing (NGS), shifted to massively parallel sequencing of millions of DNA fragments, dramatically increasing throughput. Platforms like Illumina use bridge amplification on a flow cell to generate clusters of identical DNA molecules, followed by sequencing by synthesis with reversible terminator nucleotides that emit distinct fluorescent signals upon incorporation, enabling base-by-base detection via imaging. Ion Torrent systems, in contrast, detect pH changes from hydrogen ion release during nucleotide incorporation using semiconductor chips, avoiding optical components for faster processing. These methods achieve throughputs exceeding 100 gigabases (Gb) per run and error rates of 0.1-1%, with Illumina offering higher accuracy (around 0.1-0.8%) and Ion Torrent slightly higher variability (up to 1%). Such platforms address DNA-specific challenges like GC bias through optimized library preparation, facilitating high-coverage short-read assemblies. Third-generation approaches focus on single-molecule sequencing to produce longer reads without amplification biases. ' Single-Molecule Real-Time (SMRT) sequencing immobilizes in zero-mode waveguides (ZMWs)—nanoscale wells that confine light to observe phospholinked fluorescent as they are added in real time, yielding continuous reads up to 20 kilobases (kb). employs protein nanopores, such as the engineered CsgG variant, where DNA translocation through the pore generates characteristic ionic current blockades unique to each base, enabling portable, ultra-long reads exceeding 100 kb. These technologies capture structural variants and repetitive regions more effectively than short-read NGS, though initial error rates (5-15%) require computational correction. Hybrid methods integrate short- and long-range information to enhance assembly and variant calling. For instance, ' linked-reads technology partitions long DNA molecules into droplets, barcoding subfragments before short-read sequencing, which preserves haplotype context for accurate phasing across megabases. This approach excels in resolving complex genomic regions like centromeres, combining the precision of second-generation accuracy with third-generation contiguity. As of 2025, trends emphasize error-corrected long reads, such as PacBio's circular consensus sequencing (CCS), where multiple passes over circularized DNA molecules generate high-fidelity (HiFi) reads with >99.9% accuracy and lengths of 10-20 kb. These advancements, alongside optimized workflows, have reduced whole-genome sequencing costs to approximately $200 as of 2025, democratizing access for population-scale studies and clinical diagnostics.

RNA Sequencing Approaches

RNA sequencing (RNA-seq) techniques are designed to capture the , accounting for RNA's inherent instability, post-transcriptional modifications, and diverse functional forms such as (mRNA), non-coding RNAs, and splicing variants. Unlike , RNA-seq requires specific adaptations to handle RNA degradation, poly(A) tails, and the need for reverse transcription, enabling analysis of , alternative splicing, and RNA modifications. These methods have revolutionized transcriptomics by providing quantitative insights into cellular responses, states, and developmental processes. Library preparation for begins with either enrichment of polyadenylated mRNA using oligo(dT) beads, which selectively captures mature mRNAs with poly(A) tails, or depletion of abundant (rRNA) from total to focus on non-ribosomal transcripts. Oligo(dT)-based enrichment is particularly effective for eukaryotic mRNAs, reducing complexity and improving coverage of protein-coding genes, while rRNA depletion methods, such as those using hybridization probes, preserve non-polyadenylated RNAs like long non-coding RNAs. Following isolation, is reverse transcribed into (cDNA) using enzymes, often with random hexamer primers to generate full-length strands, which are then fragmented and adapter-ligated for sequencing. This step introduces potential biases, such as preferences or fragmentation inefficiencies, but optimized protocols minimize these to ensure representative libraries. Bulk , commonly performed on Illumina platforms, measures average across a of cells, providing high-depth coverage for detecting differentially expressed s and isoforms. In this approach, poly(A)-enriched or rRNA-depleted libraries are sequenced to generate millions of short reads, which are aligned to a to quantify transcript abundance, enabling studies of in tissues or cell lines. For instance, Illumina's sequencing-by-synthesis chemistry supports paired-end reads, achieving sensitivities for low-abundance transcripts down to a few copies per . Single-cell RNA-seq (scRNA-seq) extends bulk methods to individual cells, using droplet-based encapsulation techniques like Drop-seq to isolate and transcripts from thousands of cells simultaneously.00549-8) In Drop-seq, cells are co-encapsulated with barcoded beads in nanoliter droplets, where reverse transcription occurs, allowing and computational demultiplexing to profile heterogeneity in expression profiles, such as in tumor microenvironments or developmental trajectories. This method captures ~1,000–5,000 genes per cell, revealing rare subpopulations but at the cost of higher noise due to low mRNA input.00549-8) Direct RNA-seq, exemplified by Oxford Nanopore Technologies' approach, sequences native RNA molecules without cDNA synthesis, preserving modifications like m6A and enabling full-length isoform detection. By passing poly(A)-tailed RNA through nanopores, this long-read method generates continuous reads up to tens of kilobases, bypassing reverse transcription biases and directly quantifying poly(A) tail lengths. Recent advancements, such as Oxford Nanopore's RNA004 chemistry (2024), have improved overall accuracy to approximately 93.5% as of 2025, with raw error rates around 5-10%, higher in homopolymer regions, though computational correction improves this. It facilitates studies of RNA structure and epitranscriptomics. Transcript quantification in RNA-seq involves normalizing read counts to account for sequencing depth, gene length, and composition biases. Fragments Per Kilobase of transcript per Million mapped reads (FPKM) normalizes for both length and library size, providing comparable expression values across samples, as introduced in early mammalian studies. Transcripts Per Million (TPM) improves on FPKM by first scaling to a fixed total, enhancing cross-sample comparability for isoform analysis. For differential expression, tools like DESeq2 model count data using a to estimate variance and fold changes, robustly handling in biological replicates. Specialized variants address specific RNA classes and interactions. Small RNA-seq targets microRNAs (miRNAs) and other short non-coding RNAs (18–30 ) through size-selection during library preparation, often using adapters for 5' and 3' ends to capture miRNA biogenesis intermediates and quantify regulatory elements in pathways like oncogenesis. RNA immunoprecipitation sequencing (RIP-seq) isolates RNA bound to specific proteins via pull-down, followed by sequencing to map protein-RNA interactions genome-wide, as demonstrated in early studies of Polycomb repressive complex associations. This method reveals binding sites for RNA-binding proteins, aiding understanding of . As of 2025, faces challenges in handling low-input samples, where limited starting material (e.g., <10 ng ) amplifies technical noise and dropout events, particularly in single-cell or clinical biopsies, necessitating ultra-sensitive amplification strategies. discrimination remains difficult due to high sequence similarity with parental genes, leading to misassignment of reads during alignment and confounding expression estimates, though long-read methods improve resolution. Bulk costs have declined to approximately $100–200 per sample, including library preparation and 20–30 million reads, making it accessible for large cohorts but still prohibitive for routine low-input applications.

Protein Sequencing

Classical Chemical Methods

Classical chemical methods for , developed primarily in the mid-20th century, relied on targeted chemical reactions and enzymatic cleavages to determine the linear order of in polypeptides. These approaches, predating , involved sequential degradation from the or fragmentation into smaller peptides for individual analysis, often requiring purification of the protein sample to homogeneity. Key techniques included the for stepwise N-terminal sequencing and specific cleavage methods like treatment or tryptic digestion to generate overlapping peptides that could be pieced together to reconstruct the full sequence. The cornerstone of these methods was the , introduced by Pehr Edman in 1950. This technique uses phenylisothiocyanate (PITC) to react with the free α-amino group of the N-terminal under mildly alkaline conditions, forming a phenylthiocarbamyl derivative. Subsequent acid treatment cleaves this derivative as a stable phenylthiohydantoin (PTH) , leaving the rest of the intact for the next cycle. The process involves four main steps per cycle: coupling of PITC, washing to remove excess reagent, cleavage in anhydrous acid, and extraction followed by identification of the PTH- via or (HPLC). Each cycle typically requires 1-2 days in manual implementations, allowing reliable sequencing of up to 50-60 residues before yield drops due to incomplete reactions. To sequence longer proteins, fragmentation was essential to create manageable peptides. (CNBr) cleavage, developed by Gross and Witkop in 1962, specifically targets residues by converting the thioether side chain to a salt under acidic conditions, leading to of the peptide bond on the carboxyl side of . This produces peptides ending in homoserine lactone (from ) and is performed in or similar solvents, yielding fragments that can then be separated by gel filtration or for further Edman sequencing. Complementing this, tryptic digestion employs the enzyme , which selectively cleaves peptide bonds on the carboxyl side of and residues (unless followed by ), generating a set of peptides suitable for mapping. These peptides are typically separated by HPLC or two-dimensional ( followed by ), allowing overlap analysis to assemble the full sequence. Despite their precision, classical methods had significant limitations. The overall protein size was capped at around 100 , as longer sequences led to insurmountable losses during multiple degradation cycles or fragment assembly. struggles with post-translationally modified residues, which may block the or alter PTH derivative formation, and requires a pure, unmodified sample without blocked termini. Fragmentation techniques like CNBr are limited by the number and position of methionines, while tryptic digestion can produce overly complex mixtures if the protein has many basic residues. These methods demand milligram quantities of protein and are labor-intensive, making them unsuitable for complex mixtures. A landmark application occurred in the with the sequencing of human , which provided crucial insights into sickle cell anemia. Using tryptic digestion followed by peptide fingerprinting, Vernon Ingram identified in 1956 that a single substitution ( to at position 6 of the β-chain) distinguished normal from the sickle variant, linking a molecular change to a genetic disease. Full sequencing efforts in the early , incorporating on tryptic and CNBr fragments, confirmed the 141- β-chain structure and facilitated understanding of hemoglobin's quaternary assembly.

Modern Spectrometric Techniques

Modern spectrometric techniques for primarily rely on (MS), which determines sequences by measuring the mass-to-charge ratios of ionized peptides or intact proteins. These methods enable high-throughput analysis of complex proteomes, distinguishing them from earlier chemical approaches by their ability to handle mixtures and post-translational modifications (PTMs) through fragmentation patterns. Bottom-up and top-down strategies represent the core paradigms, with liquid chromatography coupled to tandem MS (LC-MS/MS) as the standard workflow for separating and analyzing biomolecules. In , proteins are first enzymatically digested—typically with —to generate peptides of 5-20 , which are then separated by liquid chromatography and ionized for MS analysis. Tandem MS (MS/MS) fragments these peptides using (CID), where precursor ions collide with inert gas to produce sequence-informative fragment ions (b- and y-ions) whose es correspond to partial sequences. De novo sequencing can be performed by generating sequence tags—short, contiguous stretches inferred from fragment differences—allowing identification without prior database reliance, though database searching remains dominant for known proteomes. This approach excels in proteome-wide coverage, identifying thousands of proteins from a single sample. Top-down sequencing analyzes intact proteins, typically ionized via (ESI) to preserve native-like charge states, followed by fragmentation to read sequences directly from the full molecular ion. dissociation (ETD) is particularly effective here, as it cleaves the peptide backbone while leaving labile PTMs (e.g., , ) intact, enabling precise localization of modifications that might be lost in CID. Read lengths typically reach up to 100 , sufficient for characterizing isoforms and in proteins up to 50 kDa. This method provides comprehensive sequence coverage but requires higher-resolution instruments to resolve overlapping isotopic distributions in larger ions. Key instruments include the Orbitrap, which achieves ultra-high mass resolution up to 500,000 full width at half maximum (FWHM) at m/z 200, enabling unambiguous assignment of fragment ions in complex spectra through Fourier transform detection. Quadrupole time-of-flight (Q-TOF) analyzers complement this with superior scan speeds—up to 100 Hz—facilitating rapid data acquisition for high-throughput applications like single-cell proteomics. Hybrid systems, such as Orbitrap-Q-ETD, integrate multiple fragmentation modes for versatile analysis. Data analysis pipelines employ search engines like SEQUEST, which matches experimental MS/MS spectra to theoretical peptides from protein databases by scoring ion series matches, or , which uses probabilistic scoring to rank identifications based on peak intensity and mass accuracy. Post-search validation with Percolator applies to rescore matches, controlling the (FDR) to below 1% at the peptide and protein levels through semi-supervised target-decoy competition. These tools handle the stochastic nature of MS data, ensuring reliable identifications across diverse samples. As of 2025, advances include single-molecule using devices, where engineered pores detect amino acid-specific current blockades during translocation, enabling direct reading of full-length proteins without digestion—demonstrated with multi-pass strategies for error correction and PTM detection. Integration with enhances interpretation by predicting 3D structures from MS-derived sequences, aiding de novo assembly and PTM site validation through structural constraints. These developments promise routine single-molecule resolution and structure-sequence synergy for proteoform analysis.

Other Biomolecular Sequencing

Polysaccharide Sequencing

Polysaccharide sequencing involves determining the sequence of units, their anomeric configurations, glycosidic linkages, and branching patterns in carbohydrate polymers, which are essential for understanding their biological roles in , structural support, and pathogen recognition. Unlike the linear structures common in nucleic acids, polysaccharides often exhibit extensive branching and heterogeneity, complicating analysis and requiring specialized techniques such as derivatization for and multidimensional (NMR) spectroscopy. These methods enable the elucidation of complex glycan structures, which are critical components of glycoproteins and glycolipids. Monosaccharide composition analysis is a foundational step in polysaccharide sequencing, typically achieved through gas chromatography-mass spectrometry (GC-MS) following derivatization to alditol acetates. In this process, are hydrolyzed to release , which are then reduced with to alditols and acetylated to form volatile alditol acetates for separation and identification by GC-MS. This method, originally developed in the , allows quantitative determination of neutral and amino sugars with high sensitivity, distinguishing between epimers like glucose and based on retention times and mass spectra. For example, it has been widely applied to analyze plant cell wall , revealing compositions dominated by glucose, , and . Linkage analysis, which identifies the positions of glycosidic bonds, relies on permethylation followed by GC-MS of partially methylated alditol acetates (PMAAs). The is first permethylated using methyl iodide and a base like in , protecting free hydroxyl groups; subsequent acid , reduction, and yield PMAAs whose methylation patterns indicate linkage sites—e.g., a 2,3,4-tri-O-methyl glucose signals a 6-linked residue. This Hakomori-based approach, refined for glycans, provides detailed branching information and has been automated for high-throughput use, as seen in methods processing up to 96 samples. It is particularly effective for neutral polysaccharides but requires modifications for acidic sugars like uronic acids. Advanced sequencing strategies combine enzymatic digestion with NMR spectroscopy to resolve full structures, especially for branched glycans. Exoglycosidases, such as α- and β-specific glycosidases, sequentially cleave terminal residues from oligosaccharides released by endoglycosidases like PNGase F, allowing linkage-specific mapping through iterative or HPLC monitoring of digestion products. This bottom-up approach mirrors but accounts for glycan microheterogeneity. Complementarily, NMR techniques, including heteronuclear single quantum coherence (HSQC) spectra, provide atomic-level details on anomeric configurations via 1H-13C correlations in the anomeric region (δH 4.5-5.5 ppm, δC 90-105 ppm); for instance, 1J_{C1,H1} coupling constants below 168 Hz indicate β-anomers, while those above signify α-anomers, enabling unambiguous assignment in complex mixtures when combined with NOESY for through-space linkages. labeling enhances sensitivity for low-abundance glycans. Emerging nanopore-based methods, such as glycan assembly sequencing via fragmentation signatures, offer promising avenues for direct, single-molecule analysis of complex glycoforms. The primary challenges in polysaccharide sequencing stem from structural complexity and heterogeneity: branching can generate up to 10^6 possible isomers for a 20-residue glycan due to variable linkage positions (e.g., 1→2, 1→3, 1→4, 1→6) and anomeric configurations across multiple monosaccharide types, while natural glycans often exist as mixtures with varying chain lengths and modifications. This combinatorial diversity, exceeding that of proteins or nucleic acids, necessitates orthogonal methods for validation and limits high-throughput sequencing. Microheterogeneity in biological samples further complicates detection, often requiring enrichment or labeling to achieve sufficient resolution. Applications of polysaccharide sequencing are prominent in glycoprotein glycan mapping, where it informs therapeutic development, such as ensuring consistent in monoclonal antibodies to optimize efficacy and reduce immunogenicity. For instance, N-glycan analysis on reveals branching patterns affecting . These analyses are increasingly accessible for biomedical research.

Lipid Sequencing

Lipid sequencing encompasses the structural elucidation of lipid molecules, including their chains, headgroups, and modifications, which is essential for understanding dynamics, signaling pathways, and metabolic disorders in . Unlike nucleic acid or protein sequencing, lipid sequencing relies heavily on (MS) due to the chemical diversity and amphipathic nature of lipids, often integrating extraction, separation, and fragmentation techniques to map chain lengths, unsaturations, and positional isomers. This approach has evolved from classical chromatographic methods to high-throughput MS-based strategies, enabling the identification of hundreds to thousands of lipid species in complex biological samples. A foundational step in lipid sequencing is the extraction of total from biological tissues or cells, commonly achieved using the Bligh-Dyer method, which employs a chloroform-methanol-water to disrupt membranes and partition lipids into an organic phase, yielding high recovery rates for polar and nonpolar lipids in under 10 minutes. Following extraction, (TLC) fractionation separates lipid classes based on polarity, such as phospholipids from neutral lipids like triglycerides, allowing targeted downstream analysis without interference from abundant species. This combination ensures comprehensive coverage of the lipidome, with modifications like acid-catalyzed variants improving yields for bound lipids by 10-50% in diverse matrices. Fatty acid profiling, a core component of lipid sequencing, begins with transesterification of extracted lipids to form fatty acid methyl esters (FAMEs) using methanolic HCl or base catalysis, converting acyl s into volatile derivatives suitable for gas chromatography-mass spectrometry (GC-MS). In GC-MS analysis, (EI) at 70 eV generates characteristic fragments; for instance, the McLafferty rearrangement produces a prominent at m/z 74 for saturated and monounsaturated FAMEs, enabling quantification of length and through retention times and mass spectra, with detection limits below 1 ng for common s like palmitic (16:0) and oleic (18:1). This method distinguishes positional isomers indirectly via chromatographic separation but requires complementary techniques for precise double-bond localization. For more complex lipids like phospholipids, tandem MS (MS/MS) provides detailed sequencing by identifying headgroups and acyl compositions. In positive-ion mode (ESI)-MS/MS, neutral loss scans detect specific fragments; for example, a loss of 184 Da corresponds to the headgroup in phosphatidylcholines (PCs), allowing class-specific profiling in mixtures without prior separation. Precursor ion scanning for m/z 184 further confirms PC and species, while neutral losses of 141 Da identify phosphatidylethanolamines (PEs), achieving sub-femtomole sensitivity and resolving isobaric species through patterns. Advanced techniques enhance the precision of sequencing, particularly for unsaturation details. Ozone cleavage, integrated with online ozonolysis-ESI-MS, reacts selectively with carbon-carbon double bonds to produce diagnostic aldehyde fragments, pinpointing their positions in fatty acyl chains—for instance, distinguishing Δ9 from Δ12 unsaturations in (18:2) via m/z shifts in the cleaved products. Shotgun lipidomics bypasses by direct infusion of crude extracts into ESI-MS/MS, using intrasource separation and multiplexed scans to quantify over 500 lipid species across classes in minutes, with high reproducibility (CV <10%) for absolute measurements via internal standards. As of 2025, spatial via imaging (MALDI-IMS) represents a key trend, enabling mapping of distributions at 5-10 μm resolution in tissues. This approach, often using MALDI-2 for enhanced ionization, resolves over 1,000 species per sample, such as sulfatides and PCs in sections, revealing demyelination patterns and metabolic gradients without extraction artifacts. Integration with ion mobility further separates isobars, supporting applications in like tumor margin delineation.

Large-Scale Sequencing Initiatives

Major Genomic Projects

The Human Genome Project (HGP), launched in 1990 and completed in April 2003, represented the first international effort to sequence the entire human genome, spanning approximately 3 billion base pairs (3 Gb). It employed a hierarchical shotgun sequencing approach using bacterial artificial chromosomes (BACs) for assembly, achieving a high-quality reference sequence that covered over 99% of the euchromatic regions. The project identified roughly 20,000–25,000 protein-coding genes, far fewer than initial estimates, and laid the foundation for subsequent initiatives like the Encyclopedia of DNA Elements (ENCODE) project, which mapped functional elements across the genome. Its impacts include accelerating discoveries in genetics, enabling personalized medicine, and establishing standards for large-scale genomic data management. Building on the HGP, the (2008–2015) aimed to catalog by sequencing the genomes of 2,504 individuals from 26 populations across , East Asia, South Asia, Europe, and the Americas. It utilized low-coverage whole-genome sequencing with Illumina platforms at an average depth of 6x, combined with targeted deep sequencing, to identify 88 million variants, including 84.7 million single-nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions, and 60,000 structural variants. This comprehensive variant map has facilitated research, improved variant calling accuracy in clinical settings, and supported studies on disease susceptibility and evolutionary history. The Cancer Genome Atlas (TCGA), initiated in 2006 and concluding primary data collection in 2018, profiled over 11,000 primary tumors and matched normal samples across 33 cancer types using integrated multi-omics approaches, including , expression, , and . This effort identified key driver mutations, such as those in TP53 and , and revealed molecular subtypes that inform targeted therapies, like BRAF inhibitors for . TCGA's open-access data portal has driven over 20,000 publications and transformed by emphasizing tumor heterogeneity and therapeutic vulnerabilities. Launched in November 2018, the Earth BioGenome Project (EBP) seeks to sequence, catalog, and characterize the genomes of all known eukaryotic species—estimated at 1.8 million—over a decade, with a focus on conservation and understanding. As of November 2025, affiliated projects have sequenced over 3,000 genomes, leveraging long-read technologies like PacBio and Nanopore for improved assembly of complex regions, during Phase I (2023-2028) targeting approximately 10,000 genomes for family-level coverage, with Phase II (starting 2029) expanding efforts. Early outcomes include enhanced phylogenetic insights and tools for monitoring , though challenges remain in scaling to underrepresented taxa. Major genomic projects incorporate ethical frameworks to balance scientific advancement with protections, particularly through controlled via the NIH's Database of Genotypes and Phenotypes (dbGaP), which requires institutional certification and data use agreements to prevent re-identification risks. These initiatives emphasize for broad reuse, anonymization techniques, and tiered access levels to safeguard participant while promoting global .

Metagenomic and Multi-Omics Efforts

Metagenomics involves the sequencing of genetic material directly from environmental samples, bypassing the need for culturing individual organisms, to capture the collective genomic diversity of microbial communities. Shotgun metagenomic sequencing, which randomly fragments and sequences all DNA in a sample, has been pivotal in characterizing complex microbiomes. For instance, the Human Microbiome Project (HMP) in 2012 characterized microbiomes using 649 metagenomic samples across 18 body sites from 242 healthy adults, revealing unprecedented functional diversity in the human-associated microbiota. A related effort, the MetaHIT project in 2010, generated a catalog of approximately 3.3 million non-redundant microbial genes from fecal metagenomes of 124 individuals. Assembly of these shotgun reads into contiguous sequences often employs efficient tools like MEGAHIT, a memory- and time-efficient assembler that handles large datasets by using succinct de Bruijn graphs, enabling the reconstruction of metagenome-assembled genomes (MAGs) from terabyte-scale data. Subsequent binning of assembled contigs into population-level genomes utilizes algorithms such as MetaBAT, which clusters sequences based on tetranucleotide frequencies and coverage profiles to recover high-quality MAGs from diverse microbial populations. A complementary approach to is amplicon sequencing, which targets the regions of the bacterial gene for taxonomic profiling of microbial communities. This method amplifies specific regions, such as the V4 using primers like 515F (5'-GTGCCAGCMGCCGCGGTAA-3') and 806R (5'-GGACTACHVGGGTWTCTAAT-3'), to generate amplicon sequence variants (ASVs) or operational taxonomic units (OTUs). OTUs are typically clustered at 97% sequence identity to approximate species-level resolution, allowing cost-effective surveys of bacterial diversity in samples like , , or gut microbiomes. This technique has been widely adopted in projects assessing community composition, though it overlooks non-bacterial taxa and functional genes captured by methods. Environmental initiatives have expanded to global scales, such as the Tara Oceans expedition, which from 2009–2013 sampled planktonic communities across the world's oceans and produced a catalog of approximately 40 million unique microbial genes, highlighting temperature as a key driver of microbial structure and function. Long-read sequencing technologies, including PacBio, were employed in Tara to resolve complete genomes from uncultured marine microbes, improving assembly contiguity and enabling the discovery of novel biosynthetic gene clusters in rare taxa. Multi-omics efforts integrate with other layers, such as transcriptomics and , to elucidate dynamic interactions; the Integrative Human Microbiome Project (iHMP) applied this to study host-microbe dynamics in conditions like , generating longitudinal datasets that link genomic variations to gene expression and protein abundance profiles. Tools like MixOmics facilitate such integrations through multivariate statistical methods, including sparse , to identify correlated features across layers and uncover microbial contributions to host phenotypes. As of , metagenomic and multi-omics research faces significant challenges, particularly in long-read sequencing for viral communities, where fragmented assemblies and low-abundance detection hinder complete virome characterization despite advances in Oxford Nanopore and PacBio platforms. Computational demands are escalating, with metagenomic datasets in public repositories like the Sequence Read Archive exceeding 20 petabytes cumulatively and generating approximately 1 petabyte of new data annually, necessitating scalable algorithms for assembly, , and integration to manage this deluge. These efforts underscore the need for hybrid short- and long-read strategies and to advance understanding of uncultured microbial ecosystems.

References

  1. https://www.sciencedirect.com/topics/[neuroscience](/page/Neuroscience)/edman-degradation
Add your contribution
Related Hubs
User Avatar
No comments yet.