Hubbry Logo
search
logo
496875

Gene set enrichment analysis

logo
Community Hub0 Subscribers
Write something...
Be the first to start a discussion here.
Be the first to start a discussion here.
See all
Gene set enrichment analysis

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes, which are used for the analysis.

Researchers performing high-throughput experiments that yield sets of genes (for example, genes that are differentially expressed under different conditions) often want to retrieve a functional profile of that gene set, in order to better understand the underlying biological processes. This can be done by comparing the input gene set to each of the bins (terms) in the gene ontology – a statistical test can be performed for each bin to see if it is enriched for the input genes.

After the completion of the Human Genome Project, the problem of how to interpret and analyze it remained. In order to seek out genes associated with diseases, DNA microarrays were used to measure the amount of gene expression in different cells. Microarrays on thousands of different genes were carried out, and comparisons the results of two different cell categories, e.g. normal cells versus cancerous cells. However, this method of comparison is not sensitive enough to detect the subtle differences between the expression of individual genes, because diseases typically involve entire groups of genes. Multiple genes are linked to a single biological pathway, and so it is the additive change in expression within gene sets that leads to the difference in phenotypic expression. Gene Set Enrichment Analysis was developed to focus on the changes of expression in groups of a priori defined gene sets. By doing so, this method resolves the problem of the undetectable, small changes in the expression of single genes.

Gene set enrichment analysis uses a priori gene sets that have been grouped together by their involvement in the same biological pathway, or by proximal location on a chromosome. A database of these predefined sets can be found at the Molecular signatures database (MSigDB). In GSEA, DNA microarrays, or now RNA-Seq, are still performed and compared between two cell categories, but instead of focusing on individual genes in a long list, the focus is put on a gene set. Researchers analyze whether the majority of genes in the set fall in the extremes of this list: the top and bottom of the list correspond to the largest differences in expression between the two cell types. If the gene set falls at either the top (over-expressed) or bottom (under-expressed), it is thought to be related to the phenotypic differences.

In the method that is typically referred to as standard GSEA, there are three steps involved in the analytical process. The general steps are summarized below:

This can be described as:

Where is the rank of the gene, is the power usually set to 1 (if it were 0, it would be equivalent to the Kolmogorov–Smirnov test).

When GSEA was first proposed in 2003 some immediate concerns were raised regarding its methodology. These criticisms led to the use of the correlation-weighted Kolmogorov–Smirnov test, the normalized ES, and the false discovery rate calculation, all of which are the factors that currently define standard GSEA. However, GSEA has now also been criticized for the fact that its null distribution is superfluous, and too difficult to be worth calculating, as well as the fact that its Kolmogorov–Smirnov-like statistic is not as sensitive as the original. As an alternative, the method known as Simpler Enrichment Analysis (SEA), was proposed. This method assumes gene independence and uses a simpler approach to calculate t-test. However, it is thought that these assumptions are in fact too simplifying, and gene correlation cannot be disregarded.

See all
User Avatar
No comments yet.