Recent from talks
All channels
Be the first to start a discussion here.
Be the first to start a discussion here.
Be the first to start a discussion here.
Be the first to start a discussion here.
Welcome to the community hub built to collect knowledge and have discussions related to CAVEman.
Nothing was collected or created yet.
CAVEman
View on Wikipediafrom Wikipedia
This article needs additional citations for verification. (September 2014) |
CAVEman is a 4D high-resolution model of a functioning human elaborated by the University of Calgary. It resides in a cube-shaped virtual reality room, like a cave, also known as the "research holodeck", in which the human model floats in space, projected from three walls and the floor below.[1] The model is intended to be used for medical research and modelling, where the effects of medical phenomena with genetic components, such as cancer, diabetes and Alzheimer's disease can be studied virtually. The high-resolution hologram can be scaled up and down in size to study processes at the micro level.[2]
References
[edit]- ^ "Human Body Holodecks - CAVEman 3-D Virtual Patient at University of Calgary (VIDEO)". Archived from the original on 2012-01-30. Retrieved 2013-07-24.
- ^ "Home". Laser Focus World. Retrieved 2025-01-27.
External links
[edit]CAVEman
View on Grokipediafrom Grokipedia
Overview
Description
CaVEMan (Cancer Variants through Expectation Maximization) is a bioinformatics software algorithm for detecting somatic single nucleotide variants (SNVs) in next-generation sequencing (NGS) data from paired tumor-normal samples, primarily in cancer genomics.[1] Developed by the Cancer Genome Project at the Wellcome Trust Sanger Institute, it uses an expectation maximization (EM) statistical framework to probabilistically model and identify substitutions by comparing tumor alignments to matched normal controls and a reference genome, while accounting for sequencing artifacts, copy number variations, and normal cell contamination.[1] Originally implemented in Java and later rewritten in C for enhanced performance, CaVEMan processes whole-genome sequencing (WGS), whole-exome sequencing (WES), or targeted data in BAM or CRAM formats. It divides the genome into parallelizable regions, generates covariate profiles (e.g., base quality, mapping quality), and estimates variant probabilities using priors for somatic mutation rates (default 6 × 10⁻⁶) and germline polymorphisms. Outputs are in VCF format with genotype probabilities, typically yielding around 5,000 high-confidence somatic SNVs per 30x coverage tumor-normal pair after post-processing filters for specificity.[1] The tool integrates with wrappers like cgpCaVEManWrapper for execution on compute clusters and post-processing scripts (e.g., cgpCaVEManPostProcessing) to flag artifacts in repeats, high-depth regions, and germline indels.[1]Purpose and Applications
CaVEMan's primary purpose is to enable accurate detection of somatic SNVs in cancer samples with high recall and positive predictive value, distinguishing true mutations from artifacts and germline variants in large-scale sequencing data. It supports probabilistic genotyping that incorporates tumor purity, copy number, and contamination estimates, making it suitable for hypermutated or noisy samples. While optimized for matched tumor-normal pairs, it can compare any test sample to a control, though this risks including germline events.[1] Key applications include its role in major initiatives like the International Cancer Genome Consortium (ICGC) Pan-Cancer Analysis of Whole Genomes (PCAWG) project and The Cancer Genome Atlas (TCGA), where it analyzed thousands of cancer genomes to identify driver mutations and mutational signatures. For instance, in the ICGC PanCancer effort, it screened 2,500 WGS tumor-normal pairs. The tool's outputs facilitate downstream annotation (e.g., via VAGrENT) and integration with companion callers like cgpPindel for indels or ASCAT for copy number analysis. Benchmarking against validated datasets shows 90-95% specificity post-filtering. Open-source under the cancerit GitHub organization, CaVEMan requires dependencies like HTSlib and Samtools, with scalable runtime (e.g., ~3,500 CPU hours for 30x WGS on 32 cores), though it has been deprecated in some pipelines like the Genomic Data Commons in favor of newer tools. Its contributions have advanced cataloging of cancer mutational processes in landmark studies.[1][3]Development
History
The development of CaVEMan began in the early 2010s at the Wellcome Trust Sanger Institute, shortly after the advent of next-generation sequencing (NGS) technologies, to address the need for accurate somatic single nucleotide variant (SNV) detection in large-scale cancer genomics data.[1] Initially implemented in Java, the algorithm was designed as an expectation-maximization (EM) framework to probabilistically model substitutions by comparing tumor and matched normal alignments against a reference genome, while accounting for artifacts like base quality, mapping errors, and copy number variations.[1] A key milestone was the reimplementation in C around 2010 to enhance performance and scalability for processing whole-genome sequencing (WGS) data, enabling its use in early projects such as the cataloging of somatic mutations in a melanoma cell line.[1] By 2015, CaVEMan was adapted for the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) Pan-Cancer Analysis of Whole Genomes (PCAWG) project, involving the analysis of over 2,500 tumor-normal pairs.[1] To facilitate easier execution and integration, the cgpCaVEManWrapper—a Perl-based script for automated running on compute clusters—was developed in 2015 and published as a protocol in 2016.[1] Funding for the project was provided by the Wellcome Trust, including grant 098051, supporting the Cancer Genome Project's efforts in high-throughput variant calling.[1] Through the 2010s and into the 2020s, CaVEMan continued to evolve, with optimizations for GPU acceleration noted as of 2024, and it remains open-source on GitHub under the cancerit organization, though some pipelines like the Genomic Data Commons have transitioned to newer callers.[3][4]Key Contributors and Institutions
The development of CaVEMan was led by the Cancer Genome Project (CGP) at the Wellcome Trust Sanger Institute in Cambridge, United Kingdom, an interdisciplinary team of bioinformaticians, statisticians, and computational biologists focused on advancing cancer genomics through NGS analysis.[1] Dr. Peter J. Campbell, a key mathematician in the CGP, provided foundational details on the EM algorithm's probabilistic modeling.[1] Other major contributors included Keiran M. Raine, who integrated CaVEMan into pan-cancer pipelines; David Jones, Helen Davies, Patrick S. Tarpey, Adam P. Butler, Jon W. Teague, and Serena Nik-Zainal, all from the CGP, who co-authored the wrapper protocol and post-processing tools.[1] Pascal Costanza from ExaScience Lab Belgium (Intel Health & Life Sciences) assisted with the C implementation.[1] Collaborations with international consortia like ICGC and TCGA were pivotal, enabling benchmarking and application to thousands of cancer genomes.[1] The Sanger Institute's computational infrastructure supported parallel processing, making CaVEMan suitable for high-throughput environments and contributing to landmark studies on mutational processes and driver mutations in cancer.[1]Technical Components
Algorithm
CAVEman employs an expectation maximization (EM) statistical framework to detect somatic single nucleotide variants (SNVs) by probabilistically modeling substitutions in tumor samples relative to matched normal controls and a reference genome. The algorithm accounts for sequencing artifacts, copy number variations (CNVs), and normal cell contamination through priors such as somatic mutation rates (default 6 × 10⁻⁶) and germline polymorphism rates (default 1 × 10⁻⁴). It processes alignments in BAM or CRAM format, generating coverage and probability profiles to estimate variant likelihoods, with outputs including VCF files containing genotype probabilities and post-filtered high-confidence calls (typically ~5,000 SNVs per 30x coverage tumor-normal pair).[1] The core EM process involves iterative maximization (M-step) and expectation (E-step) phases: the M-step extracts read evidence from genomic chunks, while the E-step computes variant probabilities using Bayesian inference, incorporating reference bias (default 0.95) and contamination estimates (default 0.1). Coverage thresholds ensure reliability (minimum 1 read in tumor/normal), and optional inputs like CNV profiles adjust for segmental aneuploidy. This approach enables high recall and specificity, as benchmarked against validated datasets in large-scale projects like PCAWG.[3][1]Implementation
Originally developed in Java, CAVEman was rewritten in C for performance gains, comprising ~96% C code with Perl scripts (~2%) for utilities and shell scripts (~1%) for setup. It relies on dependencies including HTSlib for file I/O, Samtools for alignment handling, zlib (≥1.2.3.5) for compression, and linasm (≥1.13) for assembly operations. The codebase is modular, with core source insrc/, utilities in scripts/, and tests in tests/, built via Makefile as detailed in INSTALL.TXT.[3]
The pipeline is designed for parallel execution on compute clusters, dividing the genome into chunks based on read density (default max 50,000 bases per chunk) and processing via job indices (e.g., $LSB_JOBINDEX per chromosome). It supports five main steps invoked through binaries like bin/caveman <step> or wrappers such as cgpCaVEManWrapper: setup (generates INI config and alg_bean parameters), split (creates balanced segment lists excluding ignored regions), M-step (extracts coverage arrays per chunk), merge (concatenates genome-wide profiles), and E-step (final variant calling with VCF outputs). Runtime scales with coverage and cluster resources, optimized for whole-genome sequencing (WGS) but adaptable to WES or targeted panels. Companion tools like cgpCaVEManPostProcessing apply filters for repeats, artifacts, and indels. As of its last major update around 2016, it remains open-source under GNU AGPL v3 on GitHub.[3][1]
