Hubbry Logo
CAVEmanCAVEmanMain
Open search
CAVEman
Community hub
CAVEman
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
CAVEman
CAVEman
from Wikipedia

CAVEman is a 4D high-resolution model of a functioning human elaborated by the University of Calgary. It resides in a cube-shaped virtual reality room, like a cave, also known as the "research holodeck", in which the human model floats in space, projected from three walls and the floor below.[1] The model is intended to be used for medical research and modelling, where the effects of medical phenomena with genetic components, such as cancer, diabetes and Alzheimer's disease can be studied virtually. The high-resolution hologram can be scaled up and down in size to study processes at the micro level.[2]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
CAVEman, or Cancer Variants through Expectation Maximization, is a bioinformatics software algorithm designed to detect somatic single nucleotide variants (SNVs) in next-generation sequencing data from paired tumor-normal samples in cancer genomics. Developed in 2012 by the Cancer Genome Project at the Wellcome Trust Sanger Institute, it employs an expectation maximization (EM) statistical framework to probabilistically identify substitutions by comparing tumor alignments against matched normal controls and a reference genome, while accounting for factors like sequencing artifacts, copy number variations, and normal cell contamination. Originally implemented in Java and later rewritten in C for improved performance, CAVEman has been integral to large-scale initiatives such as the International Cancer Genome Consortium (ICGC) Pan-Cancer Analysis of Whole Genomes (PCAWG) project and The Cancer Genome Atlas (TCGA), enabling the analysis of thousands of cancer genomes to catalog driver mutations and mutational processes. The tool processes whole-genome sequencing (WGS), whole-exome sequencing (WES), or targeted data in BAM or CRAM formats, outputting variant calls in VCF files with genotype probabilities, typically yielding around 5,000 high-confidence somatic SNVs per 30x coverage tumor-normal pair after post-hoc filtering to enhance specificity. CAVEman's workflow involves dividing the genome into manageable regions for parallel processing on compute clusters, generating coverage profiles, merging them genome-wide, and then estimating variant probabilities using priors for somatic mutation rates (e.g., 6 × 10⁻⁶) and germline polymorphisms. It integrates with companion tools like cgpCaVEManWrapper for simplified execution and post-processing filters (e.g., for repeats, high-depth artifacts, and germline indels) via scripts such as cgpCaVEManPostProcessing, achieving high recall and positive predictive value in benchmarking against validated datasets. Although optimized for cancer applications, its comparative approach can detect variants relative to any control sample, though best suited for matched pairs to distinguish somatic from germline events. Open-source and available via GitHub under the cancerit organization, CAVEman requires dependencies like HTSlib and Samtools, with runtime scalable for high-throughput environments but deprecated in some pipelines like the Genomic Data Commons (GDC) in favor of newer callers. Its contributions have advanced understanding of cancer mutational landscapes, as evidenced in landmark studies sequencing the first complete cancer genomes.

Overview

Description

CaVEMan (Cancer Variants through Expectation Maximization) is a bioinformatics software algorithm for detecting somatic single nucleotide variants (SNVs) in next-generation sequencing (NGS) data from paired tumor-normal samples, primarily in cancer genomics. Developed by the Cancer Genome Project at the Wellcome Trust Sanger Institute, it uses an expectation maximization (EM) statistical framework to probabilistically model and identify substitutions by comparing tumor alignments to matched normal controls and a reference genome, while accounting for sequencing artifacts, copy number variations, and normal cell contamination. Originally implemented in Java and later rewritten in C for enhanced performance, CaVEMan processes whole-genome sequencing (WGS), whole-exome sequencing (WES), or targeted data in BAM or CRAM formats. It divides the genome into parallelizable regions, generates covariate profiles (e.g., base quality, mapping quality), and estimates variant probabilities using priors for somatic mutation rates (default 6 × 10⁻⁶) and germline polymorphisms. Outputs are in VCF format with genotype probabilities, typically yielding around 5,000 high-confidence somatic SNVs per 30x coverage tumor-normal pair after post-processing filters for specificity. The tool integrates with wrappers like cgpCaVEManWrapper for execution on compute clusters and post-processing scripts (e.g., cgpCaVEManPostProcessing) to flag artifacts in repeats, high-depth regions, and germline indels.

Purpose and Applications

CaVEMan's primary purpose is to enable accurate detection of somatic SNVs in cancer samples with high recall and positive predictive value, distinguishing true mutations from artifacts and germline variants in large-scale sequencing data. It supports probabilistic genotyping that incorporates tumor purity, copy number, and contamination estimates, making it suitable for hypermutated or noisy samples. While optimized for matched tumor-normal pairs, it can compare any test sample to a control, though this risks including germline events. Key applications include its role in major initiatives like the International Cancer Genome Consortium (ICGC) Pan-Cancer Analysis of Whole Genomes (PCAWG) project and The Cancer Genome Atlas (TCGA), where it analyzed thousands of cancer genomes to identify driver mutations and mutational signatures. For instance, in the ICGC PanCancer effort, it screened 2,500 WGS tumor-normal pairs. The tool's outputs facilitate downstream annotation (e.g., via VAGrENT) and integration with companion callers like cgpPindel for indels or ASCAT for copy number analysis. Benchmarking against validated datasets shows 90-95% specificity post-filtering. Open-source under the cancerit GitHub organization, CaVEMan requires dependencies like HTSlib and Samtools, with scalable runtime (e.g., ~3,500 CPU hours for 30x WGS on 32 cores), though it has been deprecated in some pipelines like the Genomic Data Commons in favor of newer tools. Its contributions have advanced cataloging of cancer mutational processes in landmark studies.

Development

History

The development of CaVEMan began in the early 2010s at the Wellcome Trust Sanger Institute, shortly after the advent of next-generation sequencing (NGS) technologies, to address the need for accurate somatic single nucleotide variant (SNV) detection in large-scale cancer genomics data. Initially implemented in Java, the algorithm was designed as an expectation-maximization (EM) framework to probabilistically model substitutions by comparing tumor and matched normal alignments against a reference genome, while accounting for artifacts like base quality, mapping errors, and copy number variations. A key milestone was the reimplementation in C around 2010 to enhance performance and scalability for processing whole-genome sequencing (WGS) data, enabling its use in early projects such as the cataloging of somatic mutations in a melanoma cell line. By 2015, CaVEMan was adapted for the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) Pan-Cancer Analysis of Whole Genomes (PCAWG) project, involving the analysis of over 2,500 tumor-normal pairs. To facilitate easier execution and integration, the cgpCaVEManWrapper—a Perl-based script for automated running on compute clusters—was developed in 2015 and published as a protocol in 2016. Funding for the project was provided by the Wellcome Trust, including grant 098051, supporting the Cancer Genome Project's efforts in high-throughput variant calling. Through the 2010s and into the 2020s, CaVEMan continued to evolve, with optimizations for GPU acceleration noted as of 2024, and it remains open-source on GitHub under the cancerit organization, though some pipelines like the Genomic Data Commons have transitioned to newer callers.

Key Contributors and Institutions

The development of CaVEMan was led by the Cancer Genome Project (CGP) at the Wellcome Trust Sanger Institute in Cambridge, United Kingdom, an interdisciplinary team of bioinformaticians, statisticians, and computational biologists focused on advancing cancer genomics through NGS analysis. Dr. Peter J. Campbell, a key mathematician in the CGP, provided foundational details on the EM algorithm's probabilistic modeling. Other major contributors included Keiran M. Raine, who integrated CaVEMan into pan-cancer pipelines; David Jones, Helen Davies, Patrick S. Tarpey, Adam P. Butler, Jon W. Teague, and Serena Nik-Zainal, all from the CGP, who co-authored the wrapper protocol and post-processing tools. Pascal Costanza from ExaScience Lab Belgium (Intel Health & Life Sciences) assisted with the C implementation. Collaborations with international consortia like ICGC and TCGA were pivotal, enabling benchmarking and application to thousands of cancer genomes. The Sanger Institute's computational infrastructure supported parallel processing, making CaVEMan suitable for high-throughput environments and contributing to landmark studies on mutational processes and driver mutations in cancer.

Technical Components

Algorithm

CAVEman employs an expectation maximization (EM) statistical framework to detect somatic single nucleotide variants (SNVs) by probabilistically modeling substitutions in tumor samples relative to matched normal controls and a reference genome. The algorithm accounts for sequencing artifacts, copy number variations (CNVs), and normal cell contamination through priors such as somatic mutation rates (default 6 × 10⁻⁶) and germline polymorphism rates (default 1 × 10⁻⁴). It processes alignments in BAM or CRAM format, generating coverage and probability profiles to estimate variant likelihoods, with outputs including VCF files containing genotype probabilities and post-filtered high-confidence calls (typically ~5,000 SNVs per 30x coverage tumor-normal pair). The core EM process involves iterative maximization (M-step) and expectation (E-step) phases: the M-step extracts read evidence from genomic chunks, while the E-step computes variant probabilities using Bayesian inference, incorporating reference bias (default 0.95) and contamination estimates (default 0.1). Coverage thresholds ensure reliability (minimum 1 read in tumor/normal), and optional inputs like CNV profiles adjust for segmental aneuploidy. This approach enables high recall and specificity, as benchmarked against validated datasets in large-scale projects like PCAWG.

Implementation

Originally developed in Java, CAVEman was rewritten in C for performance gains, comprising ~96% C code with Perl scripts (~2%) for utilities and shell scripts (~1%) for setup. It relies on dependencies including HTSlib for file I/O, Samtools for alignment handling, zlib (≥1.2.3.5) for compression, and linasm (≥1.13) for assembly operations. The codebase is modular, with core source in src/, utilities in scripts/, and tests in tests/, built via Makefile as detailed in INSTALL.TXT. The pipeline is designed for parallel execution on compute clusters, dividing the genome into chunks based on read density (default max 50,000 bases per chunk) and processing via job indices (e.g., $LSB_JOBINDEX per chromosome). It supports five main steps invoked through binaries like bin/caveman <step> or wrappers such as cgpCaVEManWrapper: setup (generates INI config and alg_bean parameters), split (creates balanced segment lists excluding ignored regions), M-step (extracts coverage arrays per chunk), merge (concatenates genome-wide profiles), and E-step (final variant calling with VCF outputs). Runtime scales with coverage and cluster resources, optimized for whole-genome sequencing (WGS) but adaptable to WES or targeted panels. Companion tools like cgpCaVEManPostProcessing apply filters for repeats, artifacts, and indels. As of its last major update around 2016, it remains open-source under GNU AGPL v3 on GitHub.

Research and Usage

CaVEMan has been widely used in cancer genomics research for detecting somatic single nucleotide variants (SNVs) in paired tumor-normal samples. Developed by the Cancer Genome Project at the Wellcome Trust Sanger Institute, it played a key role in early whole-genome sequencing studies of cancer, such as the analysis of 21 breast cancer genomes published in 2012, where it enabled the identification of mutational signatures and driver mutations. The tool's expectation maximization framework has facilitated high-throughput variant calling, contributing to catalogs of cancer driver genes and processes. In large-scale consortia, CaVEMan was integral to the International Cancer Genome Consortium's Pan-Cancer Analysis of Whole Genomes (PCAWG) project, which analyzed over 2,600 cancer genomes to characterize mutational patterns across 38 tumor types, and The Cancer Genome Atlas (TCGA), supporting the discovery of thousands of somatic variants per sample after filtering. It has been applied in diverse studies, including investigations of clonal expansions in morphologically normal tissues and progressive changes in colorectal cancer evolution, often integrated with tools like Pindel for indels and ASCAT for copy number variations to provide comprehensive genomic profiling.

Pipeline Integrations and Optimizations

CaVEMan is typically executed via the cgpCaVEManWrapper script, which simplifies setup, parallel processing on compute clusters, and post-hoc filtering for artifacts like repeats and high-depth regions, achieving high recall and specificity in benchmarks against validated datasets. The open-source implementation on GitHub supports BAM and CRAM formats via HTSlib, with dependencies like Samtools, and is designed for scalability in high-throughput environments. Recent advancements include GPU acceleration using NVIDIA Parabricks, which has reduced CaVEMan runtime by up to 10-fold at the Sanger Institute, enabling faster analysis of large cohorts for projects like the Cancer Dependency Map. As of 2024, while still used in specialized pipelines, it has been supplemented or replaced by newer callers in some platforms like the Genomic Data Commons due to evolving standards in variant detection. Its contributions continue to underpin understandings of cancer mutational landscapes in ongoing research.

Impact and Legacy

Scientific Contributions

CaVEMan has significantly advanced cancer genomics by enabling the detection of somatic single nucleotide variants (SNVs) in large-scale next-generation sequencing (NGS) datasets from paired tumor-normal samples. Developed by the Cancer Genome Project at the Wellcome Trust Sanger Institute, it uses an expectation maximization (EM) framework to provide high-recall variant calling while accounting for sequencing artifacts, copy number variations, and contamination. Its probabilistic approach outputs genotype likelihoods in VCF format, typically identifying around 5,000 high-confidence somatic SNVs per 30x coverage whole-genome sequencing (WGS) pair after filtering, achieving 90-95% specificity in benchmarks. The algorithm has been pivotal in major initiatives, including the International Cancer Genome Consortium (ICGC) Pan-Cancer Analysis of Whole Genomes (PCAWG) project, where it processed over 2,500 WGS tumor-normal pairs to catalog driver mutations and mutational signatures across 38 cancer types, as detailed in the 2020 Nature publication. It also supported The Cancer Genome Atlas (TCGA) and early efforts like the sequencing of the COLO-829 melanoma genome in 2010, contributing to the first comprehensive catalogs of somatic mutations. Integrated with companion tools such as ASCAT for copy number analysis and Pindel for indels, CaVEMan forms part of standardized pipelines for reproducible somatic calling. Key publications include the 2016 protocol on cgpCaVEManWrapper, which simplifies its execution and has been cited over 200 times for practical NGS workflows, and benchmarking studies validating its performance against other callers like MuTect and Strelka. Open-sourced on GitHub under the cancerit organization since around 2010, CaVEMan has influenced variant calling standards and tools in the field, with its reimplementation in C improving scalability for high-throughput computing environments.

Future Developments

While CaVEMan remains a core tool in cancer genomics pipelines as of 2024, adaptations are ongoing to accommodate newer sequencing technologies, aligners like BWA-MEM, and systematic artifacts from updated chemistries. Future enhancements may focus on integrating multi-sample calling for tumor evolution studies and improving efficiency for ultra-high coverage or long-read data, building on its role in projects like PCAWG. However, some pipelines, such as the Genomic Data Commons (GDC), have shifted to newer callers, indicating a need for continued evolution to maintain relevance in evolving NGS landscapes. No specific timelines for major updates are publicly detailed, but its open-source nature supports community-driven improvements.
Add your contribution
Related Hubs
User Avatar
No comments yet.