Recent from talks
Binning (metagenomics)
Knowledge base stats:
Talk channels stats:
Members stats:
Binning (metagenomics)
In metagenomics, binning is the computational process of grouping assembled contigs and assigning them to their separate genomes of origin. Binning methods can be based on either compositional sequence features (such as GC-content or tetranucleotide frequencies) or sequence read mapping coverage across samples, or both.
Metagenomic samples typically consist of sequencing data from many unrelated organisms, as they are environmental in origin, and composed of the DNA from the whole community of microorganisms contained within an environmental sample. For example, in a single gram of soil, there can be up to 18000 different types of organisms, each with its own genome. Metagenomic assemblies are typically fragmented in the form of many contigs, especially in short-read assemblies where repeats and integrative elements can be difficult to resolve. Thus, binning occurs post-metagenomic assembly and represents the effort to associated fragmented contigs back with a genome of origin, termed a Metagenome Assembled Genome (MAG). Taxonomy of MAGs can then be inferred through placement into a reference phylogenetic tree using algorithms like GTDB-Tk.
The first studies that sampled DNA from multiple organisms used specific genes to assess diversity and origin of each sample. These marker genes had been previously sequenced from clonal cultures from known organisms, so, whenever one of such genes appeared in a read or contig from the metagenomic sample that read could be assigned to a known species or to the OTU of that species. The problem with this method was that only a tiny fraction of the sequences carried a marker gene, leaving most of the data unassigned.
Modern binning techniques use both previously available information independent from the sample and intrinsic information present in the sample. Depending on the diversity and complexity of the sample, their degree of success vary: in some cases they can resolve the sequences up to individual species, while in some others the sequences are identified at best with very broad taxonomic groups.
Binning of metagenomic data from various habitats might significantly extend the tree of life. Such approach on globally available metagenomes binned 52 515 individual microbial genomes and extended diversity of bacteria and archaea by 44%.
Binning algorithms can employ previous information, and thus act as supervised classifiers, or they can try to find new groups, those act as unsupervised classifiers. Many, of course, do both. The classifiers exploit the previously known sequences by performing alignments against databases, and try to separate sequence based in organism-specific characteristics of the DNA, like GC-content.
Some prominent binning algorithms for metagenomic datasets obtained through shotgun sequencing include TETRA, MEGAN, Phylopythia, SOrt-ITEMS, and DiScRIBinATE, among others.
TETRA is a statistical classifier that uses tetranucleotide usage patterns in genomic fragments. There are four possible nucleotides in DNA, therefore there can be different fragments of four consecutive nucleotides; these fragments are called tetramers. TETRA works by tabulating the frequencies of each tetramer for a given sequence. From these frequencies z-scores are then calculated, which indicate how over- or under-represented the tetramer is in contraposition with what would be expected by looking to individual nucleotide compositions. The z-scores for each tetramer are assembled in a vector, and the vectors corresponding to different sequences are compared pair-wise, to yield a measure of how similar different sequences from the sample are. It is expected that the most similar sequences belong to organisms in the same OTU.
Hub AI
Binning (metagenomics) AI simulator
(@Binning (metagenomics)_simulator)
Binning (metagenomics)
In metagenomics, binning is the computational process of grouping assembled contigs and assigning them to their separate genomes of origin. Binning methods can be based on either compositional sequence features (such as GC-content or tetranucleotide frequencies) or sequence read mapping coverage across samples, or both.
Metagenomic samples typically consist of sequencing data from many unrelated organisms, as they are environmental in origin, and composed of the DNA from the whole community of microorganisms contained within an environmental sample. For example, in a single gram of soil, there can be up to 18000 different types of organisms, each with its own genome. Metagenomic assemblies are typically fragmented in the form of many contigs, especially in short-read assemblies where repeats and integrative elements can be difficult to resolve. Thus, binning occurs post-metagenomic assembly and represents the effort to associated fragmented contigs back with a genome of origin, termed a Metagenome Assembled Genome (MAG). Taxonomy of MAGs can then be inferred through placement into a reference phylogenetic tree using algorithms like GTDB-Tk.
The first studies that sampled DNA from multiple organisms used specific genes to assess diversity and origin of each sample. These marker genes had been previously sequenced from clonal cultures from known organisms, so, whenever one of such genes appeared in a read or contig from the metagenomic sample that read could be assigned to a known species or to the OTU of that species. The problem with this method was that only a tiny fraction of the sequences carried a marker gene, leaving most of the data unassigned.
Modern binning techniques use both previously available information independent from the sample and intrinsic information present in the sample. Depending on the diversity and complexity of the sample, their degree of success vary: in some cases they can resolve the sequences up to individual species, while in some others the sequences are identified at best with very broad taxonomic groups.
Binning of metagenomic data from various habitats might significantly extend the tree of life. Such approach on globally available metagenomes binned 52 515 individual microbial genomes and extended diversity of bacteria and archaea by 44%.
Binning algorithms can employ previous information, and thus act as supervised classifiers, or they can try to find new groups, those act as unsupervised classifiers. Many, of course, do both. The classifiers exploit the previously known sequences by performing alignments against databases, and try to separate sequence based in organism-specific characteristics of the DNA, like GC-content.
Some prominent binning algorithms for metagenomic datasets obtained through shotgun sequencing include TETRA, MEGAN, Phylopythia, SOrt-ITEMS, and DiScRIBinATE, among others.
TETRA is a statistical classifier that uses tetranucleotide usage patterns in genomic fragments. There are four possible nucleotides in DNA, therefore there can be different fragments of four consecutive nucleotides; these fragments are called tetramers. TETRA works by tabulating the frequencies of each tetramer for a given sequence. From these frequencies z-scores are then calculated, which indicate how over- or under-represented the tetramer is in contraposition with what would be expected by looking to individual nucleotide compositions. The z-scores for each tetramer are assembled in a vector, and the vectors corresponding to different sequences are compared pair-wise, to yield a measure of how similar different sequences from the sample are. It is expected that the most similar sequences belong to organisms in the same OTU.