Recent from talks
Nothing was collected or created yet.
Coverage (genetics)
View on Wikipedia
In genetics, coverage is one of several measures of the depth or completeness of DNA sequencing, and is more specifically expressed in any of the following terms:
- Sequence coverage (or depth) is the number of unique reads that include a given nucleotide in the reconstructed sequence.[1][2] Deep sequencing refers to the general concept of aiming for high number of unique reads of each region of a sequence.[3]
- Physical coverage, the cumulative length of reads or read pairs expressed as a multiple of genome size.[4]
- Genomic coverage, the percentage of all base pairs or loci of the genome covered by sequencing.
Sequence coverage
[edit]Rationale
[edit]Even though the sequencing accuracy for each individual nucleotide is very high, the very large number of nucleotides in the genome means that if an individual genome is only sequenced once, there will be a significant number of sequencing errors. Furthermore, many positions in a genome contain rare single-nucleotide polymorphisms (SNPs). Hence to distinguish between sequencing errors and true SNPs, it is necessary to increase the sequencing accuracy even further by sequencing individual genomes a large number of times.
Ultra-deep sequencing
[edit]The term "ultra-deep" can sometimes also refer to higher coverage (>100-fold), which allows for detection of sequence variants in mixed populations.[5][6][7] In the extreme, error-corrected sequencing approaches such as Maximum-Depth Sequencing can make it so that coverage of a given region approaches the throughput of a sequencing machine, allowing coverages of >10^8.[8]
Transcriptome sequencing
[edit]Deep sequencing of transcriptomes, also known as RNA-Seq, provides both the sequence and frequency of RNA molecules that are present at any particular time in a specific cell type, tissue or organ.[9] Counting the number of mRNAs that are encoded by individual genes provides an indicator of protein-coding potential, a major contributor to phenotype.[10] Improving methods for RNA sequencing is an active area of research both in terms of experimental and computational methods.[11]
Calculation
[edit]The average coverage for a whole genome can be calculated from the length of the original genome (G), the number of reads (N), and the average read length (L) as . For example, a hypothetical genome with 2,000 base pairs reconstructed from 8 reads with an average length of 500 nucleotides will have 2× redundancy. This parameter also enables one to estimate other quantities, such as the percentage of the genome covered by reads (sometimes also called breadth of coverage). A high coverage in shotgun sequencing is desired because it can overcome errors in base calling and assembly. The subject of DNA sequencing theory addresses the relationships of such quantities.[2]
Physical coverage
[edit]Sometimes a distinction is made between sequence coverage and physical coverage. Where sequence coverage is the average number of times a base is read, physical coverage is the average number of times a base is read or spanned by mate paired reads.[2][12][4]
Genomic coverage
[edit]In terms of genomic coverage and accuracy, whole genome sequencing can broadly be classified into either of the following:[13]
- A draft sequence, covering approximately 90% of the genome at approximately 99.9% accuracy
- A finished sequence, covering more than 95% of the genome at approximately 99.99% accuracy
Producing a truly high-quality finished sequence by this definition is very expensive. Thus, most human "whole genome sequencing" results are draft sequences (sometimes above and sometimes below the accuracy defined above).[13]
References
[edit]- ^ "Sequencing Coverage". illumina.com. Illumina education. Retrieved 2020-10-08.
- ^ a b c Sims, David; Sudbery, Ian; Ilott, Nicholas E.; Heger, Andreas; Ponting, Chris P. (2014). "Sequencing depth and coverage: key considerations in genomic analyses". Nature Reviews Genetics. 15 (2): 121–132. doi:10.1038/nrg3642. PMID 24434847. S2CID 13325739.
- ^ Mardis, Elaine R. (2008-09-01). "Next-Generation DNA Sequencing Methods". Annual Review of Genomics and Human Genetics. 9 (1): 387–402. doi:10.1146/annurev.genom.9.081307.164359. ISSN 1527-8204. PMID 18576944.
- ^ a b Ekblom, Robert; Wolf, Jochen B. W. (2014). "A field guide to whole-genome sequencing, assembly and annotation". Evolutionary Applications. 7 (9): 1026–42. Bibcode:2014EvApp...7.1026E. doi:10.1111/eva.12178. PMC 4231593. PMID 25553065.
- ^ Ajay SS, Parker SC, Abaan HO, Fajardo KV, Margulies EH (September 2011). "Accurate and comprehensive sequencing of personal genomes". Genome Res. 21 (9): 1498–505. doi:10.1101/gr.123638.111. PMC 3166834. PMID 21771779.
- ^ Mirebrahim, Hamid; Close, Timothy J.; Lonardi, Stefano (2015-06-15). "De novo meta-assembly of ultra-deep sequencing data". Bioinformatics. 31 (12): i9 – i16. doi:10.1093/bioinformatics/btv226. ISSN 1367-4803. PMC 4765875. PMID 26072514.
- ^ Beerenwinkel, Niko; Zagordi, Osvaldo (2011-11-01). "Ultra-deep sequencing for the analysis of viral populations". Current Opinion in Virology. 1 (5): 413–418. doi:10.1016/j.coviro.2011.07.008. PMID 22440844.
- ^ Jee, J.; Rasouly, A.; Shamovsky, I.; Akivis, Y.; Steinman, S.; Mishra, B.; Nudler, E. (2016). "Rates and mechanisms of bacterial mutagenesis from maximum-depth sequencing". Nature. 534 (7609): 693–696. Bibcode:2016Natur.534..693J. doi:10.1038/nature18313. PMC 4940094. PMID 27338792.
- ^ Malone, John H.; Oliver, Brian (2011-01-01). "Microarrays, deep sequencing and the true measure of the transcriptome". BMC Biology. 9 34. doi:10.1186/1741-7007-9-34. ISSN 1741-7007. PMC 3104486. PMID 21627854.
- ^ Hampton M, Melvin RG, Kendall AH, Kirkpatrick BR, Peterson N, Andrews MT (2011). "Deep sequencing the transcriptome reveals seasonal adaptive mechanisms in a hibernating mammal". PLOS ONE. 6 (10) e27021. Bibcode:2011PLoSO...627021H. doi:10.1371/journal.pone.0027021. PMC 3203946. PMID 22046435.
- ^ Heyer EE, Ozadam H, Ricci EP, Cenik C, Moore MJ (2015). "An optimized kit-free method for making strand-specific deep sequencing libraries from RNA fragments". Nucleic Acids Res. 43 (1): e2. doi:10.1093/nar/gku1235. PMC 4288154. PMID 25505164.
- ^ Meyerson, M.; Gabriel, S.; Getz, G. (2010). "Advances in understanding cancer genomes through second-generation sequencing". Nature Reviews Genetics. 11 (10): 685–696. doi:10.1038/nrg2841. PMID 20847746. S2CID 2544266.
- ^ a b Kris A. Wetterstrand, M.S. "The Cost of Sequencing a Human Genome". National Human Genome Research Institute. Last updated: November 1, 2021
Coverage (genetics)
View on GrokipediaSequence Coverage
Definition and Rationale
Sequence coverage, also known as sequencing depth, is defined as the average number of unique sequencing reads that align to and include a given nucleotide position in the reconstructed genome sequence.[1] This metric quantifies the redundancy with which each base is sampled during next-generation sequencing (NGS), ensuring that the data provides sufficient overlap to reconstruct the original sequence accurately.[5] The rationale for achieving high sequence coverage stems from the need to minimize errors arising from random sampling in DNA sequencing, particularly in shotgun approaches where fragments are generated randomly and assembled based on overlaps.[6] A minimum coverage of 20-30x is typically required for de novo genome assembly to reduce gaps and assembly ambiguities, as lower depths increase the probability of missing regions or introducing errors in base calling.[2] Higher coverage, such as 30x or more, is essential for detecting rare variants like single nucleotide polymorphisms (SNPs) with high confidence, as it improves the statistical power to distinguish true variants from sequencing noise or low-frequency artifacts.[6] Sequence coverage emerged as a critical concept in the 1990s alongside the development of shotgun sequencing strategies, which addressed the challenges of random fragmentation and Poisson-distributed read placement in large genomes. The foundational Lander-Waterman theoretical model, introduced in 1988, formalized these principles by predicting the expected coverage needed to achieve near-complete genome reconstruction while accounting for overlaps and contig formation. In human genome projects following the completion of the Human Genome Project in 2003, 30x coverage became a standard benchmark for reliable base calling and variant identification in whole-genome sequencing, reflecting advances in NGS throughput that enabled deeper sampling without prohibitive costs.[1] This depth serves as the counterpart to genomic coverage (breadth) and physical coverage (read spanning), providing the necessary redundancy for accurate reconstruction.[6]Ultra-deep Sequencing
Ultra-deep sequencing in genetics refers to targeted or whole-genome sequencing approaches that achieve coverage depths exceeding 100x, often reaching 500–1,000x or higher, to enable the detection of low-frequency genetic variants present at frequencies below 1%. This level of depth surpasses standard sequencing requirements by providing sufficient reads to distinguish true rare variants from sequencing noise, particularly in heterogeneous samples where mutant alleles may constitute only a small fraction of the total population.[7][8] Key applications of ultra-deep sequencing include cancer genomics, where it facilitates the identification of subclonal mutations in tumors that drive disease progression and resistance to therapy. For instance, in low-purity or polyclonal tumors, depths of thousands of reads per base allow for precise profiling of somatic mutations that might otherwise be missed. Similarly, it is essential for viral quasispecies analysis, enabling the reconstruction of diverse viral populations within a host by capturing minor variants that contribute to evolution and adaptation. Somatic mutation detection in non-cancer contexts, such as aging or environmental exposures, also benefits from this approach to uncover low-abundance changes in genomic DNA.[7][9][10] Despite its advantages, ultra-deep sequencing presents significant challenges, including heightened computational demands for aligning and analyzing vast numbers of reads, which can require specialized bioinformatics pipelines to manage data volumes exceeding billions of bases. Costs escalate with depth due to the need for more sequencing reagents and storage, making it less feasible for large-scale studies without optimization. Additionally, error rates in base calling, typically around 0.1–0.3% per base in next-generation platforms, become more pronounced at extreme depths, necessitating advanced error-correction methods to avoid false positives in variant calling.[11][12][13] Studies of SARS-CoV-2 intra-host evolution have employed ultra-deep sequencing to detect rare variants, providing insights into viral diversification within individual patients.[14]Applications in Transcriptomics
In transcriptomics, sequence coverage adapts the concept of read depth from genomics to measure the number of sequencing reads aligning to specific transcripts, enabling precise quantification of gene expression levels and detection of alternative splicing variants. This depth directly influences the reliability of expression estimates, as higher coverage captures low-abundance transcripts that might otherwise be missed due to stochastic sampling in RNA-Seq. For instance, in bulk RNA-Seq, achieving sufficient depth ensures comprehensive transcriptome representation, with studies recommending 30 million reads for detecting over 90% of annotated genes in model organisms like chicken, a proxy for similar complexities in human transcriptomes.[15][16] Key concepts include coverage uniformity, which assesses even distribution of reads across transcript bodies to minimize biases from 3' end enrichment or GC content, thereby supporting accurate normalization metrics such as transcripts per million (TPM) or fragments per kilobase per million (FPKM). Non-uniform coverage can skew these metrics, leading to unreliable inter-sample comparisons, so tools like RSeQC evaluate and correct for such biases during quality control. Saturation curves further guide depth requirements by plotting detected transcripts against increasing read numbers; for the human transcriptome, 30-50 million reads typically achieve saturation for most expressed genes, beyond which additional sequencing yields diminishing returns in expression accuracy.[16][17] Specific applications leverage coverage for differential expression analysis, where adequate depth (e.g., 20-100 million reads) enhances statistical power to identify condition-specific changes, as insufficient coverage reduces detection of lowly expressed differentially regulated genes. In de novo transcriptome assembly, high coverage (e.g., 20-25 million read pairs) improves contig connectivity and completeness, with assemblies representing at least 80% of input reads considered optimal for reconstructing novel isoforms without a reference genome. Coverage also links transcriptomic profiles to phenotypic traits, such as in disease states, where RNA-Seq depth reveals dysregulated expression patterns associated with biomarkers in cancers or rare disorders, facilitating eQTL mapping to connect variants with clinical outcomes.[18][19][20] In single-cell RNA-Seq, variable coverage per cell due to capture inefficiencies necessitates normalization techniques like empirical Bayes estimators to correct for dropout and Poisson noise, ensuring robust expression quantification across heterogeneous populations. Advancements in barcoding, such as SPLiT-seq (2018) and Smart-seq3 (2020), have improved depth efficiency by enabling scalable, low-cost profiling (e.g., $0.10-1 per cell) with reduced doublets and enhanced transcript recovery, allowing deeper insights into cellular states in disease contexts.[21][22]Calculation and Formulas
The average sequence coverage, often denoted as depth , is calculated as the total number of sequenced bases divided by the size of the reference genome. This is expressed by the formula , where is the number of reads, is the average read length, and is the haploid genome length in base pairs.[1][23] To derive this, first compute the total bases sequenced as the sum of all read lengths, , which simplifies to for uniform read lengths. Dividing by yields the average depth, assuming reads are randomly distributed across the genome; this does not account for overlaps explicitly in the basic form but represents the expected multiplicity of coverage per position. In assembly contexts, overlaps are adjusted using probabilistic models, as the effective coverage influences contig formation and gap probabilities. The Lander-Waterman model provides the expected coverage distribution under random shotgun sequencing, where the probability that a base is uncovered is , leading to an expected fraction of the genome covered (breadth) of .[1][24] Advanced calculations model coverage uniformity via a Poisson distribution, where the number of reads covering any given position follows , with for reads at that site; this approximation holds for large and random read placement, enabling predictions of regions with zero or high depth. Breadth-depth trade-offs arise because increasing beyond ~5-10x yields diminishing returns in breadth (approaching 100% coverage) while enhancing depth for variant detection, balancing sequencing costs against analytical goals.[24][25] For empirical computation from aligned reads (e.g., in BAM files), software like samtools calculates per-position depth using commands such assamtools depth to output coverage histograms or averages, aggregating aligned bases while handling overlaps and quality filters. As an illustrative example, sequencing 100 million reads of 100 bp each against a 3 Gb human genome yields , sufficient for basic assembly but low for uniform variant calling.[26]
