Recent from talks
Contribute something
Nothing was collected or created yet.
Gene expression profiling
View on Wikipedia
In the field of molecular biology, gene expression profiling is the measurement of the activity (the expression) of thousands of genes at once, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are actively dividing, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell.
Several transcriptomics technologies can be used to generate the necessary data to analyse. DNA microarrays[1] measure the relative activity of previously identified target genes. Sequence based techniques, like RNA-Seq, provide information on the sequences of genes in addition to their expression level.
Background
[edit]Expression profiling is a logical next step after sequencing a genome: the sequence tells us what the cell could possibly do, while the expression profile tells us what it is actually doing at a point in time. Genes contain the instructions for making messenger RNA (mRNA), but at any moment each cell makes mRNA from only a fraction of the genes it carries. If a gene is used to produce mRNA, it is considered "on", otherwise "off". Many factors determine whether a gene is on or off, such as the time of day, whether or not the cell is actively dividing, its local environment, and chemical signals from other cells. For instance, skin cells, liver cells and nerve cells turn on (express) somewhat different genes and that is in large part what makes them different. Therefore, an expression profile allows one to deduce a cell's type, state, environment, and so forth.
Expression profiling experiments often involve measuring the relative amount of mRNA expressed in two or more experimental conditions. This is because altered levels of a specific sequence of mRNA suggest a changed need for the protein coded by the mRNA, perhaps indicating a homeostatic response or a pathological condition. For example, higher levels of mRNA coding for alcohol dehydrogenase suggest that the cells or tissues under study are responding to increased levels of ethanol in their environment. Similarly, if breast cancer cells express higher levels of mRNA associated with a particular transmembrane receptor than normal cells do, it might be that this receptor plays a role in breast cancer. A drug that interferes with this receptor may prevent or treat breast cancer. In developing a drug, one may perform gene expression profiling experiments to help assess the drug's toxicity, perhaps by looking for changing levels in the expression of cytochrome P450 genes, which may be a biomarker of drug metabolism.[2] Gene expression profiling may become an important diagnostic test.[3][4]
Comparison to proteomics
[edit]The human genome contains on the order of 20,000 genes which work in concert to produce roughly 1,000,000 distinct proteins. This is due to alternative splicing, and also because cells make important changes to proteins through posttranslational modification after they first construct them, so a given gene serves as the basis for many possible versions of a particular protein. In any case, a single mass spectrometry experiment can identify about 2,000 proteins[5] or 0.2% of the total. While knowledge of the precise proteins a cell makes (proteomics) is more relevant than knowing how much messenger RNA is made from each gene,[why?] gene expression profiling provides the most global picture possible in a single experiment. However, proteomics methodology is improving. In other species, such as yeast, it is possible to identify over 4,000 proteins in just over one hour.[6]
Use in hypothesis generation and testing
[edit]Sometimes, a scientist already has an idea of what is going on, a hypothesis, and he or she performs an expression profiling experiment with the idea of potentially disproving this hypothesis. In other words, the scientist is making a specific prediction about levels of expression that could turn out to be false.
More commonly, expression profiling takes place before enough is known about how genes interact with experimental conditions for a testable hypothesis to exist. With no hypothesis, there is nothing to disprove, but expression profiling can help to identify a candidate hypothesis for future experiments. Most early expression profiling experiments, and many current ones, have this form[7] which is known as class discovery. A popular approach to class discovery involves grouping similar genes or samples together using one of the many existing clustering methods such the traditional k-means or hierarchical clustering, or the more recent MCL.[8] Apart from selecting a clustering algorithm, user usually has to choose an appropriate proximity measure (distance or similarity) between data objects.[9] The figure above represents the output of a two dimensional cluster, in which similar samples (rows, above) and similar gene probes (columns) were organized so that they would lie close together. The simplest form of class discovery would be to list all the genes that changed by more than a certain amount between two experimental conditions.
Class prediction is more difficult than class discovery, but it allows one to answer questions of direct clinical significance such as, given this profile, what is the probability that this patient will respond to this drug? This requires many examples of profiles that responded and did not respond, as well as cross-validation techniques to discriminate between them.
Limitations
[edit]In general, expression profiling studies report those genes that showed statistically significant differences under changed experimental conditions. This is typically a small fraction of the genome for several reasons. First, different cells and tissues express a subset of genes as a direct consequence of cellular differentiation so many genes are turned off. Second, many of the genes code for proteins that are required for survival in very specific amounts so many genes do not change. Third, cells use many other mechanisms to regulate proteins in addition to altering the amount of mRNA, so these genes may stay consistently expressed even when protein concentrations are rising and falling. Fourth, financial constraints limit expression profiling experiments to a small number of observations of the same gene under identical conditions, reducing the statistical power of the experiment, making it impossible for the experiment to identify important but subtle changes. Finally, it takes a great amount of effort to discuss the biological significance of each regulated gene, so scientists often limit their discussion to a subset. Newer microarray analysis techniques automate certain aspects of attaching biological significance to expression profiling results, but this remains a very difficult problem.
The relatively short length of gene lists published from expression profiling experiments limits the extent to which experiments performed in different laboratories appear to agree. Placing expression profiling results in a publicly accessible microarray database makes it possible for researchers to assess expression patterns beyond the scope of published results, perhaps identifying similarity with their own work.
Validation of high throughput measurements
[edit]Both DNA microarrays and quantitative PCR exploit the preferential binding or "base pairing" of complementary nucleic acid sequences, and both are used in gene expression profiling, often in a serial fashion. While high throughput DNA microarrays lack the quantitative accuracy of qPCR, it takes about the same time to measure the gene expression of a few dozen genes via qPCR as it would to measure an entire genome using DNA microarrays. So it often makes sense to perform semi-quantitative DNA microarray analysis experiments to identify candidate genes, then perform qPCR on some of the most interesting candidate genes to validate the microarray results. Other experiments, such as a Western blot of some of the protein products of differentially expressed genes, make conclusions based on the expression profile more persuasive, since the mRNA levels do not necessarily correlate to the amount of expressed protein.
Statistical analysis
[edit]Data analysis of microarrays has become an area of intense research.[10] Simply stating that a group of genes were regulated by at least twofold, once a common practice, lacks a solid statistical footing. With five or fewer replicates in each group, typical for microarrays, a single outlier observation can create an apparent difference greater than two-fold. In addition, arbitrarily setting the bar at two-fold is not biologically sound, as it eliminates from consideration many genes with obvious biological significance.
Rather than identify differentially expressed genes using a fold change cutoff, one can use a variety of statistical tests or omnibus tests such as ANOVA, all of which consider both fold change and variability to create a p-value, an estimate of how often we would observe the data by chance alone. Applying p-values to microarrays is complicated by the large number of multiple comparisons (genes) involved. For example, a p-value of 0.05 is typically thought to indicate significance, since it estimates a 5% probability of observing the data by chance. But with 10,000 genes on a microarray, 500 genes would be identified as significant at p < 0.05 even if there were no difference between the experimental groups. One obvious solution is to consider significant only those genes meeting a much more stringent p value criterion, e.g., one could perform a Bonferroni correction on the p-values, or use a false discovery rate calculation to adjust p-values in proportion to the number of parallel tests involved. Unfortunately, these approaches may reduce the number of significant genes to zero, even when genes are in fact differentially expressed. Current statistics such as Rank products aim to strike a balance between false discovery of genes due to chance variation and non-discovery of differentially expressed genes. Commonly cited methods include the Significance Analysis of Microarrays (SAM)[11] and a wide variety of methods are available from Bioconductor and a variety of analysis packages from bioinformatics companies.
Selecting a different test usually identifies a different list of significant genes[12] since each test operates under a specific set of assumptions, and places a different emphasis on certain features in the data. Many tests begin with the assumption of a normal distribution in the data, because that seems like a sensible starting point and often produces results that appear more significant. Some tests consider the joint distribution of all gene observations to estimate general variability in measurements,[13] while others look at each gene in isolation. Many modern microarray analysis techniques involve bootstrapping (statistics), machine learning or Monte Carlo methods.[14]
As the number of replicate measurements in a microarray experiment increases, various statistical approaches yield increasingly similar results, but lack of concordance between different statistical methods makes array results appear less trustworthy. The MAQC Project[15] makes recommendations to guide researchers in selecting more standard methods (e.g. using p-value and fold-change together for selecting the differentially expressed genes) so that experiments performed in different laboratories will agree better.
Different from the analysis on differentially expressed individual genes, another type of analysis focuses on differential expression or perturbation of pre-defined gene sets and is called gene set analysis.[16][17] Gene set analysis demonstrated several major advantages over individual gene differential expression analysis.[16][17] Gene sets are groups of genes that are functionally related according to current knowledge. Therefore, gene set analysis is considered a knowledge based analysis approach.[16] Commonly used gene sets include those derived from KEGG pathways, Gene Ontology terms, gene groups that share some other functional annotations, such as common transcriptional regulators etc. Representative gene set analysis methods include Gene Set Enrichment Analysis (GSEA),[16] which estimates significance of gene sets based on permutation of sample labels, and Generally Applicable Gene-set Enrichment (GAGE),[17] which tests the significance of gene sets based on permutation of gene labels or a parametric distribution.
Gene annotation
[edit]While the statistics may identify which gene products change under experimental conditions, making biological sense of expression profiling rests on knowing which protein each gene product makes and what function this protein performs. Gene annotation provides functional and other information, for example the location of each gene within a particular chromosome. Some functional annotations are more reliable than others; some are absent. Gene annotation databases change regularly, and various databases refer to the same protein by different names, reflecting a changing understanding of protein function. Use of standardized gene nomenclature helps address the naming aspect of the problem, but exact matching of transcripts to genes[18][19] remains an important consideration.
Categorizing regulated genes
[edit]Having identified some set of regulated genes, the next step in expression profiling involves looking for patterns within the regulated set. Do the proteins made from these genes perform similar functions? Are they chemically similar? Do they reside in similar parts of the cell? Gene ontology analysis provides a standard way to define these relationships. Gene ontologies start with very broad categories, e.g., "metabolic process" and break them down into smaller categories, e.g., "carbohydrate metabolic process" and finally into quite restrictive categories like "inositol and derivative phosphorylation".
Genes have other attributes beside biological function, chemical properties and cellular location. One can compose sets of genes based on proximity to other genes, association with a disease, and relationships with drugs or toxins. The Molecular Signatures Database[20] and the Comparative Toxicogenomics Database[21] are examples of resources to categorize genes in numerous ways.
Finding patterns among regulated genes
[edit]
Regulated genes are categorized in terms of what they are and what they do, important relationships between genes may emerge.[23] For example, we might see evidence that a certain gene creates a protein to make an enzyme that activates a protein to turn on a second gene on our list. This second gene may be a transcription factor that regulates yet another gene from our list. Observing these links we may begin to suspect that they represent much more than chance associations in the results, and that they are all on our list because of an underlying biological process. On the other hand, it could be that if one selected genes at random, one might find many that seem to have something in common. In this sense, we need rigorous statistical procedures to test whether the emerging biological themes is significant or not. That is where gene set analysis[16][17] comes in.
Cause and effect relationships
[edit]Fairly straightforward statistics provide estimates of whether associations between genes on lists are greater than what one would expect by chance. These statistics are interesting, even if they represent a substantial oversimplification of what is really going on. Here is an example. Suppose there are 10,000 genes in an experiment, only 50 (0.5%) of which play a known role in making cholesterol. The experiment identifies 200 regulated genes. Of those, 40 (20%) turn out to be on a list of cholesterol genes as well. Based on the overall prevalence of the cholesterol genes (0.5%) one expects an average of 1 cholesterol gene for every 200 regulated genes, that is, 0.005 times 200. This expectation is an average, so one expects to see more than one some of the time. The question becomes how often we would see 40 instead of 1 due to pure chance.
According to the hypergeometric distribution, one would expect to try about 10^57 times (10 followed by 56 zeroes) before picking 39 or more of the cholesterol genes from a pool of 10,000 by drawing 200 genes at random. Whether one pays much attention to how infinitesimally small the probability of observing this by chance is, one would conclude that the regulated gene list is enriched[24] in genes with a known cholesterol association.
One might further hypothesize that the experimental treatment regulates cholesterol, because the treatment seems to selectively regulate genes associated with cholesterol. While this may be true, there are a number of reasons why making this a firm conclusion based on enrichment alone represents an unwarranted leap of faith. One previously mentioned issue has to do with the observation that gene regulation may have no direct impact on protein regulation: even if the proteins coded for by these genes do nothing other than make cholesterol, showing that their mRNA is altered does not directly tell us what is happening at the protein level. It is quite possible that the amount of these cholesterol-related proteins remains constant under the experimental conditions. Second, even if protein levels do change, perhaps there is always enough of them around to make cholesterol as fast as it can be possibly made, that is, another protein, not on our list, is the rate determining step in the process of making cholesterol. Finally, proteins typically play many roles, so these genes may be regulated not because of their shared association with making cholesterol but because of a shared role in a completely independent process.
Bearing the foregoing caveats in mind, while gene profiles do not in themselves prove causal relationships between treatments and biological effects, they do offer unique biological insights that would often be very difficult to arrive at in other ways.
Using patterns to find regulated genes
[edit]As described above, one can identify significantly regulated genes first and then find patterns by comparing the list of significant genes to sets of genes known to share certain associations. One can also work the problem in reverse order. Here is a very simple example. Suppose there are 40 genes associated with a known process, for example, a predisposition to diabetes. Looking at two groups of expression profiles, one for mice fed a high carbohydrate diet and one for mice fed a low carbohydrate diet, one observes that all 40 diabetes genes are expressed at a higher level in the high carbohydrate group than the low carbohydrate group. Regardless of whether any of these genes would have made it to a list of significantly altered genes, observing all 40 up, and none down appears unlikely to be the result of pure chance: flipping 40 heads in a row is predicted to occur about one time in a trillion attempts using a fair coin.
For a type of cell, the group of genes whose combined expression pattern is uniquely characteristic to a given condition constitutes the gene signature of this condition. Ideally, the gene signature can be used to select a group of patients at a specific state of a disease with accuracy that facilitates selection of treatments.[25][26] Gene Set Enrichment Analysis (GSEA)[16] and similar methods[17] take advantage of this kind of logic but uses more sophisticated statistics, because component genes in real processes display more complex behavior than simply moving up or down as a group, and the amount the genes move up and down is meaningful, not just the direction. In any case, these statistics measure how different the behavior of some small set of genes is compared to genes not in that small set.
GSEA uses a Kolmogorov Smirnov style statistic to see whether any previously defined gene sets exhibited unusual behavior in the current expression profile. This leads to a multiple hypothesis testing challenge, but reasonable methods exist to address it.[27]
Conclusions
[edit]Expression profiling provides new information about what genes do under various conditions. Overall, microarray technology produces reliable expression profiles.[28] From this information one can generate new hypotheses about biology or test existing ones. However, the size and complexity of these experiments often results in a wide variety of possible interpretations. In many cases, analyzing expression profiling results takes far more effort than performing the initial experiments.
Most researchers use multiple statistical methods and exploratory data analysis before publishing their expression profiling results, coordinating their efforts with a bioinformatician or other expert in DNA microarrays, RNA sequencing and single cell sequencing. Good experimental design, adequate biological replication and follow up experiments play key roles in successful expression profiling experiments.
See also
[edit]References
[edit]- ^ "Microarrays Factsheet". Archived from the original on April 9, 2002. Retrieved 2007-12-28.
- ^ Suter L, Babiss LE, Wheeldon EB (2004). "Toxicogenomics in predictive toxicology in drug development". Chem. Biol. 11 (2): 161–71. doi:10.1016/j.chembiol.2004.02.003. PMID 15123278.
- ^ Magic Z, Radulovic S, Brankovic-Magic M (2007). "cDNA microarrays: identification of gene signatures and their application in clinical practice". J BUON. 12 (Suppl 1): S39–44. PMID 17935276.
- ^ Cheung AN (2007). "Molecular targets in gynaecological cancers". Pathology. 39 (1): 26–45. doi:10.1080/00313020601153273. PMID 17365821. S2CID 40896577.
- ^ Mirza SP, Olivier M (2007). "Methods and approaches for the comprehensive characterization and quantification of cellular proteomes using mass spectrometry". Physiol Genomics. 33 (1): 3–11. doi:10.1152/physiolgenomics.00292.2007. PMC 2771641. PMID 18162499.
- ^ Hebert AS, Richards AL, et al. (2014). "The One Hour Yeast Proteome". Mol Cell Proteomics. 13 (1): 339–347. doi:10.1074/mcp.M113.034769. PMC 3879625. PMID 24143002.
- ^ Chen JJ (2007). "Key aspects of analyzing microarray gene-expression data". Pharmacogenomics. 8 (5): 473–82. doi:10.2217/14622416.8.5.473. PMID 17465711.
- ^ van Dongen, Stijn (2000). Graph Clustering by Flow Simulation. University of Utrecht.
- ^ Jaskowiak, Pablo A; Campello, Ricardo JGB; Costa, Ivan G (24 January 2014). "On the selection of appropriate distances for gene expression data clustering". BMC Bioinformatics. 15 (Suppl 2): S2. doi:10.1186/1471-2105-15-S2-S2. PMC 4072854. PMID 24564555.
- ^ Vardhanabhuti S, Blakemore SJ, Clark SM, Ghosh S, Stephens RJ, Rajagopalan D (2006). "A comparison of statistical tests for detecting differential expression using Affymetrix oligonucleotide microarrays". OMICS. 10 (4): 555–66. doi:10.1089/omi.2006.10.555. PMID 17233564.
- ^ "Significance Analysis of Microarrays". Archived from the original on 2008-01-20. Retrieved 2007-12-27.
- ^ Yauk CL, Berndt ML (2007). "Review of the literature examining the correlation among DNA microarray technologies". Environ. Mol. Mutagen. 48 (5): 380–94. Bibcode:2007EnvMM..48..380Y. doi:10.1002/em.20290. PMC 2682332. PMID 17370338.
- ^ Breitling R (2006). "Biological microarray interpretation: the rules of engagement" (PDF). Biochim. Biophys. Acta. 1759 (7): 319–27. doi:10.1016/j.bbaexp.2006.06.003. PMID 16904203. S2CID 1857997.
- ^ Draminski M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J (2008). "Monte Carlo feature selection for supervised classification". Bioinformatics. 24 (1): 110–7. doi:10.1093/bioinformatics/btm486. PMID 18048398.
- ^ Dr. Leming Shi, National Center for Toxicological Research. "MicroArray Quality Control (MAQC) Project". U.S. Food and Drug Administration. Retrieved 2007-12-26.[dead link]
- ^ a b c d e f Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP (2005). "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles". Proc. Natl. Acad. Sci. U.S.A. 102 (43): 15545–50. doi:10.1073/pnas.0506580102. PMC 1239896. PMID 16199517.
- ^ a b c d e Luo W, Friedman M, Shedden K, Hankenson KD, Woolf JP (2009). "GAGE: generally applicable gene set enrichment for pathway analysis". BMC Bioinformatics. 10: 161. doi:10.1186/1471-2105-10-161. PMC 2696452. PMID 19473525.
- ^ Dai M, Wang P, Boyd AD, et al. (2005). "Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data". Nucleic Acids Res. 33 (20): e175. doi:10.1093/nar/gni179. PMC 1283542. PMID 16284200.
- ^ Alberts R, Terpstra P, Hardonk M, et al. (2007). "A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat". BMC Bioinformatics. 8: 132. doi:10.1186/1471-2105-8-132. PMC 1865557. PMID 17448222.
- ^ "GSEA – MSigDB". Retrieved 2008-01-03.
- ^ "CTD: The Comparative Toxicogenomics Database". Retrieved 2008-01-03.
- ^ "Ingenuity Systems". Retrieved 2007-12-27.
- ^ Alekseev OM, Richardson RT, Alekseev O, O'Rand MG (2009). "Analysis of gene expression profiles in HeLa cells in response to overexpression or siRNA-mediated depletion of NASP". Reprod. Biol. Endocrinol. 7: 45. doi:10.1186/1477-7827-7-45. PMC 2686705. PMID 19439102.
- ^ Curtis RK, Oresic M, Vidal-Puig A (2005). "Pathways to the analysis of microarray data". Trends Biotechnol. 23 (8): 429–35. doi:10.1016/j.tibtech.2005.05.011. PMID 15950303.
- ^ Mook S, Van't Veer LJ, Rutgers EJ, Piccart-Gebhart MJ, Cardoso F (2007). "Individualization of therapy using Mammaprint: from development to the MINDACT Trial". Cancer Genomics Proteomics. 4 (3): 147–55. PMID 17878518.
- ^ Corsello SM, Roti G, Ross KN, Chow KT, Galinsky I, DeAngelo DJ, Stone RM, Kung AL, Golub TR, Stegmaier K (June 2009). "Identification of AML1-ETO modulators by chemical genomics". Blood. 113 (24): 6193–205. doi:10.1182/blood-2008-07-166090. PMC 2699238. PMID 19377049.
- ^ "GSEA". Retrieved 2008-01-09.
- ^ Couzin J (2006). "Genomics. Microarray data reproduced, but some concerns remain". Science. 313 (5793): 1559. doi:10.1126/science.313.5793.1559a. PMID 16973852. S2CID 58528299.
External links
[edit]Gene expression profiling
View on GrokipediaFundamentals
Definition and Principles
Gene expression profiling is the simultaneous measurement of the expression levels of multiple or all genes within a biological sample, typically achieved by quantifying the abundance of messenger RNA (mRNA) transcripts, to produce a comprehensive profile representing the transcriptome under defined conditions.[5] This technique captures the dynamic activity of genes, allowing researchers to assess how cellular states, environmental stimuli, or disease processes alter transcriptional output across the genome. The resulting profile provides a snapshot of gene activity, highlighting patterns that reflect biological function and regulation.[6] The foundational principles of gene expression profiling stem from the central dogma of molecular biology, which outlines the unidirectional flow of genetic information from DNA to RNA via transcription, followed by translation into proteins.[7] Transcription serves as the primary regulatory checkpoint, where external signals modulate the initiation and rate of mRNA synthesis, making it a focal point for profiling efforts.[8] Quantitatively, profiling measures expression as relative or absolute mRNA levels, often expressed in terms of fold changes, to distinguish qualitative differences in gene activation or repression between samples.[9] Central to this approach are key concepts such as the transcriptome, which comprises the complete set of RNA molecules transcribed from the genome at a specific time, and differential expression, referring to statistically significant variations in gene activity across conditions or cell types.[6][10] Normalization of data typically relies on housekeeping genes—constitutively expressed genes like GAPDH or ACTB that maintain stable levels—to correct for technical biases in measurement.[11] Although mRNA abundance approximates protein production by indicating transcriptional output, this correlation is imperfect due to post-transcriptional controls, including mRNA stability and translational efficiency, which can decouple transcript levels from final protein amounts.[12][13] As an illustrative example, gene expression profiling of immune cells during bacterial infection often detects upregulation of genes encoding cytokines and antimicrobial peptides, such as those in the interferon pathway, thereby revealing the molecular basis of the host's defensive response.[14]Historical Development
The foundations of gene expression profiling trace back to low-throughput techniques developed in the late 1970s, such as Northern blotting, which enabled the detection and quantification of specific RNA transcripts by hybridizing labeled probes to electrophoretically separated RNA samples transferred to a membrane. This method, introduced by Alwine et al. in 1977, laid the groundwork for measuring mRNA abundance but was limited to analyzing one or a few genes per experiment due to its labor-intensive nature.[15] By the mid-1990s, advancements like Serial Analysis of Gene Expression (SAGE), developed by Velculescu et al., marked a shift toward higher-throughput profiling by generating short sequence tags from expressed genes, allowing simultaneous analysis of thousands of transcripts via Sanger sequencing.[16] The microarray era began in 1995 with the invention of complementary DNA (cDNA) microarrays by Schena, Shalon, and colleagues under Patrick Brown at Stanford University, enabling parallel hybridization-based measurement of thousands of gene expressions on glass slides printed with DNA probes.[17] Commercialization accelerated in 1996 when Affymetrix released the GeneChip platform, featuring high-density oligonucleotide arrays for genome-wide expression monitoring, as demonstrated in early applications like Lockhart et al.'s work on hybridization to arrays.[18] Microarrays gained widespread adoption during the 2000s, playing a key role in the Human Genome Project's functional annotation efforts and enabling large-scale studies, such as Golub et al.'s 1999 demonstration of cancer subclassification using gene expression patterns from acute leukemias.[19] The advent of next-generation sequencing (NGS) around 2005, exemplified by the 454 pyrosequencing platform, revolutionized profiling by shifting from hybridization to direct sequencing of cDNA fragments, drastically increasing throughput and reducing biases.[20] RNA-Seq emerged as a cornerstone in 2008 with Mortazavi et al.'s method for mapping and quantifying mammalian transcriptomes through deep sequencing, providing unbiased detection of novel transcripts and precise abundance measurements.[21] By the 2010s, NGS costs plummeted—from millions per genome in the early 2000s to approximately $50–$200 per sample for RNA-Seq as of 2024, trending under $100 by 2025—driving a transition to sequencing-based methods over microarrays for most applications.[22] In the 2010s, single-cell RNA-Seq (scRNA-Seq) advanced resolution to individual cells, with early protocols like Tang et al. in 2009 evolving into scalable droplet-based systems such as Drop-seq in 2015 by Macosko et al., enabling profiling of thousands of cells to uncover cellular heterogeneity.[23][24] Spatial transcriptomics further integrated positional data, highlighted by 10x Genomics' Visium platform launched in 2019, which captures gene expression on tissue sections at near-single-cell resolution.[25] Into the 2020s, integration of artificial intelligence has enhanced pattern detection in expression data, as seen in models like GET (2025) that simulate and predict gene expression dynamics from sequencing inputs to identify disease-associated regulatory networks.[26]Techniques
Microarray-Based Methods
Microarray-based methods for gene expression profiling rely on the hybridization of labeled nucleic acids to immobilized DNA probes on a solid substrate, enabling the simultaneous measurement of expression levels for thousands of genes. In this approach, short DNA sequences known as probes, complementary to target genes of interest, are fixed to a chip or slide. Total RNA or mRNA from the sample is reverse-transcribed into complementary DNA (cDNA), labeled with fluorescent dyes, and allowed to hybridize to the probes. The intensity of fluorescence at each probe location, detected via laser scanning, quantifies the abundance of corresponding transcripts, providing a snapshot of gene expression patterns.[17][27] Two primary types of microarrays are used: cDNA microarrays and oligonucleotide microarrays. cDNA microarrays typically employ longer probes (500–1,000 base pairs) derived from cloned cDNA fragments, which are spotted onto the array surface using robotic printing; these often operate in a two-color format, where samples from two conditions (e.g., control and treatment) are labeled with distinct dyes like Cy3 (green) and Cy5 (red) and hybridized to the same array for direct ratio-based comparisons.[17][28] In contrast, oligonucleotide microarrays use shorter synthetic probes (25–60 mers), either spotted or synthesized in situ; prominent examples include the Affymetrix GeneChip, which features in situ photolithographic synthesis of one-color arrays with multiple probes per gene for mismatch controls to enhance specificity, and Illumina BeadChips, which attach oligonucleotides to microbeads in wells for high-density, one-color detection.[29] NimbleGen arrays represent a variant of oligonucleotide microarrays using maskless photolithography for flexible, high-density probe synthesis, supporting both one- and two-color formats. Spotted arrays (common for cDNA) offer flexibility in custom probe selection but may suffer from variability in spotting, while in situ synthesized arrays provide uniformity and higher probe densities, up to 1.4 million probe sets (comprising over 5 million probes) on platforms like the Affymetrix Exon 1.0 ST array.[28][30] The standard workflow begins with RNA extraction from cells or tissues, followed by isolation of mRNA and reverse transcription to generate first-strand cDNA. This cDNA is then labeled—using Cy3 and Cy5 for two-color arrays or a single dye like biotin for one-color systems—and hybridized to the microarray overnight under controlled temperature and stringency conditions to allow specific binding. Post-hybridization, unbound material is washed away, and the array is scanned with a laser to measure fluorescence intensities at each probe spot, yielding raw data as pixel intensity values that reflect transcript abundance.[27] These methods achieved peak adoption in the 2000s for high-throughput profiling of known genes, offering cost-effective analysis for targeted gene panels, but have become niche with the rise of sequencing technologies due to limitations like probe cross-hybridization, which can lead to false positives from non-specific binding, and an inability to detect novel or low-abundance transcripts beyond the fixed probe set.[27] Compared to sequencing, microarrays exhibit lower dynamic range, typically spanning 3–4 orders of magnitude in detection sensitivity.[31] Invented in 1995, this technology revolutionized expression analysis by enabling genome-scale studies.[17]Sequencing-Based Methods
Sequencing-based methods for gene expression profiling primarily rely on RNA sequencing (RNA-Seq), which enables comprehensive, unbiased measurement of the transcriptome by directly sequencing RNA molecules or their complementary DNA derivatives. Introduced as a transformative approach in the late 2000s, RNA-Seq has become the gold standard for transcriptomics since the 2010s, surpassing microarray techniques due to its ability to detect novel transcripts without prior knowledge of gene sequences. The core mechanism begins with RNA extraction from cells or tissues, followed by fragmentation to generate shorter pieces suitable for sequencing. These fragments are then reverse-transcribed into complementary DNA (cDNA), which undergoes library preparation involving end repair, adapter ligation, and amplification to create a sequencing-ready library.[32] Next-generation sequencing (NGS) platforms, such as Illumina's short-read systems, are commonly used to sequence these libraries, producing millions of reads that represent the original RNA population. The resulting data, typically output as FASTQ files containing raw sequence reads, require alignment to a reference genome using tools like STAR or HISAT2 to map reads accurately, accounting for splicing events. Quantification occurs by counting aligned reads per gene or transcript, often via featureCounts or Salmon, yielding digital expression measures in the form of read counts.[32] A key step in library preparation is mRNA enrichment, either through poly-A selection for eukaryotic polyadenylated transcripts or ribosomal RNA (rRNA) depletion to capture non-coding and prokaryotic RNAs, ensuring comprehensive coverage. Sequencing depth for human samples generally ranges from 20 to 50 million reads per sample to achieve robust detection of expressed genes, with higher depths for low-input or complex analyses.[33] Bulk RNA-Seq represents the standard variant, aggregating expression from millions of cells to provide an average profile suitable for population-level studies. Single-cell RNA-Seq (scRNA-Seq) extends this to individual cells, enabling dissection of cellular heterogeneity; droplet-based methods like those from 10x Genomics, commercialized around 2016, fueled an explosion in scRNA-Seq applications post-2016 by allowing high-throughput profiling of thousands to tens of thousands of cells per run.[34] Long-read sequencing technologies, such as PacBio's Iso-Seq, offer full-length transcript coverage, excelling in isoform resolution and alternative splicing detection without the need for computational assembly of short reads. Spatial RNA-Seq variants, including 10x Genomics' Visium platform introduced in 2020 and building on earlier spatial transcriptomics from 2016, preserve tissue architecture by capturing transcripts on spatially barcoded arrays, mapping expression to specific locations within samples. These methods provide key advantages, including the discovery of novel transcripts, precise quantification of alternative splicing, and sensitive detection of low-abundance genes, which microarrays cannot achieve due to reliance on predefined probes. RNA-Seq exhibits a dynamic range exceeding 10^5-fold, far surpassing the ~10^3-fold of arrays, allowing accurate measurement across expression levels from rare transcripts to highly abundant ones. By 2025, costs for bulk RNA-Seq have declined to under $200 per sample, including library preparation and sequencing, driven by advances in multiplexing and platform efficiency.[22] In precision medicine, RNA-Seq variants like scRNA-Seq are increasingly applied to resolve tumor heterogeneity, informing personalized therapies by revealing subclonal variations and therapeutic responses as of 2025.[35]Other Techniques
Other methods for gene expression profiling include digital molecular barcoding approaches, such as the NanoString nCounter system, which uses color-coded barcoded probes to directly hybridize with target RNA molecules without amplification or sequencing. This technique enables targeted quantification of up to 1,000 genes per sample with high precision and reproducibility, particularly useful for clinical diagnostics and validation studies due to its low technical variability and ability to handle degraded RNA.[2] Unlike microarrays, NanoString provides digital counts rather than analog signals, reducing background noise, though it is limited to predefined gene panels and less comprehensive than RNA-Seq.[36]Data Acquisition and Preprocessing
Experimental Design
The experimental design phase of gene expression profiling begins with clearly defining the biological question to guide all subsequent decisions, such as investigating the effect of a treatment on gene expression in specific cell types or tissues. This involves specifying the hypothesis, such as detecting differential expression due to a drug perturbation or disease state, to ensure the experiment addresses targeted objectives rather than exploratory aims. For instance, questions focused on treatment effects might prioritize controlled perturbations like siRNA knockdown or pharmacological interventions, while those involving disease modeling could use patient-derived samples. Adhering to established guidelines, such as the Minimum Information About a Microarray Experiment (MIAME) introduced in 2001, ensures comprehensive documentation of experimental parameters for reproducibility, with updates extending to sequencing-based methods via MINSEQE.[37][38] Sample selection and preparation are critical, encompassing choices like cell lines for in vitro studies, animal tissues for preclinical models, or human biopsies for clinical relevance. Biological replicates, derived from independent sources (e.g., different animals or patients), are essential to capture variability, with a minimum of three recommended per group to enable statistical inference, though six or more enhance power for detecting subtle changes. Technical replicates, which assess measurement consistency, should supplement but not replace biological ones. To mitigate systematic biases, randomization of sample processing order is employed to avoid batch effects, where unintended variations from equipment or timing confound results. Controls include reference samples (e.g., untreated baselines) and exogenous spike-ins like ERCC controls for RNA-Seq, which provide standardized benchmarks for normalization and sensitivity assessment across experiments.[39][40] Sample size determination relies on power analysis to detect desired fold changes (e.g., 1.5- to 2-fold) with adequate statistical power (typically 80-90%), factoring in expected variability and sequencing depth; tools like RNAseqPS facilitate this by simulating Poisson or negative binomial distributions for RNA-Seq data. For microarray experiments, similar calculations apply but emphasize probe hybridization efficiency. Platform selection weighs microarray for cost-effective, targeted profiling of known genes against RNA-Seq for unbiased, comprehensive transcriptome coverage, including low-abundance transcripts and isoforms, though RNA-Seq incurs higher costs and requires deeper sequencing for rare events. High-throughput formats, such as 96-well plates for single-cell RNA-Seq, support scaled designs but demand careful optimization. When human samples are involved, ethical oversight via Institutional Review Board (IRB) approval is mandatory to ensure informed consent, privacy protection, and minimal risk. Best practices from projects like ENCODE in the 2010s emphasize these elements for robust, reproducible RNA-Seq designs.[41][42][43][44][45]Normalization and Quality Control
Normalization and quality control are essential initial steps in processing gene expression data to ensure reliability and comparability across samples. Normalization addresses technical variations such as differences in starting RNA amounts, library preparation efficiencies, and sequencing depths, while quality control identifies and mitigates artifacts like low-quality reads or outliers that could skew downstream analyses. These processes aim to remove systematic biases without altering biological signals, enabling accurate quantification of gene expression levels.[46] For microarray-based methods, quantile normalization is a widely adopted technique that adjusts probe intensities so that the distribution of values across arrays matches a reference distribution, typically the average empirical distribution of all samples. This method assumes that most genes are not differentially expressed and equalizes the rank-order statistics between arrays, effectively correcting for global shifts and scaling differences. Introduced by Bolstad et al., quantile normalization has become standard in tools like the limma package for preprocessing Affymetrix and other oligonucleotide arrays.[47][48] In sequencing-based methods like RNA-seq, normalization accounts for both sequencing depth (library size) and gene length biases to produce comparable expression estimates. Common metrics include reads per kilobase of transcript per million mapped reads (RPKM), fragments per kilobase of transcript per million mapped reads (FPKM) for paired-end data, and transcripts per million (TPM), which scales RPKM to sum to 1 million across genes for better cross-sample comparability. TPM is calculated as: This formulation ensures length- and depth-normalized values that are additive across transcripts. For count-based differential analysis, methods like the median-of-ratios approach in DESeq2 estimate size factors by dividing each gene's counts by its geometric mean across samples, then taking the median of these ratios as the normalization factor.[49][50] Quality control begins with assessing RNA integrity using the RNA Integrity Number (RIN), an automated metric derived from electropherogram analysis that scores total RNA from 1 (degraded) to 10 (intact), with values above 7 generally recommended for reliable gene expression profiling. For sequencing data, tools like FastQC evaluate raw reads for per-base quality scores, adapter contamination, overrepresented sequences, and GC content bias, flagging issues that necessitate trimming or filtering. Post-alignment, principal component analysis (PCA) plots visualize sample clustering to detect outliers, while saturation curves assess sequencing depth adequacy by plotting unique reads against total reads. Low-quality reads are typically removed using thresholds such as Phred scores below 20.[51][52] Batch effects, arising from technical variables like different experimental runs or reagent lots, can confound biological interpretations and are detected via PCA or surrogate variable analysis showing non-biological clustering. The ComBat method corrects these using an empirical Bayes framework that adjusts expression values while preserving biological variance, modeling batch as a covariate in a parametric or non-parametric manner. Spike-in controls, such as External RNA Controls Consortium (ERCC) mixes added at known concentrations, facilitate absolute quantification and validation of normalization by providing an independent scale for technical performance assessment.[53][54] Common pitfalls include ignoring 3' bias in poly-A selected RNA-seq, where reverse transcription from oligo-dT primers favors reads near the poly-A tail, leading to uneven coverage and distorted expression estimates for genes with varying 3' UTR lengths. Replicates from experimental design aid in robust QC by allowing variance estimation during outlier detection. Software packages like edgeR (using trimmed mean of M-values, TMM, normalization) and limma (with voom transformation for count data) integrate these preprocessing steps seamlessly before differential expression analysis.[55][46][48]Analysis Methods
Differential Expression Analysis
Differential expression analysis identifies genes whose expression levels differ significantly between experimental conditions, such as treated versus control samples or disease versus healthy states, by comparing normalized expression values across groups.[50] The core metric is the log2 fold change (log2FC), which quantifies the magnitude of change on a logarithmic scale, where a log2FC of 1 indicates a twofold upregulation and -1 a twofold downregulation.[56] Statistical significance is assessed using hypothesis tests to compute p-values, which are then adjusted for multiple testing across thousands of genes to control the false discovery rate (FDR) via methods like the Benjamini-Hochberg procedure. This approach assumes that expression data have been preprocessed through normalization to account for technical variations. For microarray data, which produce continuous intensity values, standard parametric tests like the t-test are commonly applied, though moderated versions improve reliability by borrowing information across genes. The limma package implements linear models with empirical Bayes moderation of t-statistics, enhancing power for detecting differences in small sample sizes. In contrast, RNA-Seq data consist of discrete read counts following a negative binomial distribution to model biological variability and overdispersion, where variance exceeds the mean.[50] Tools like DESeq2 fit generalized linear models assuming variance = mean + α × mean², with shrinkage estimation for dispersions and fold changes to stabilize estimates for low-count genes.[50] Similarly, edgeR employs empirical Bayes methods to estimate common and tagwise dispersions, enabling robust testing even with limited replicates.[56] Genes are typically selected as differentially expressed using thresholds such as |log2FC| > 1 and FDR < 0.05, balancing biological relevance and statistical confidence.[50] Results are often visualized in volcano plots, scatter plots of log2FC against -log10(p-value), where points above significance cutoffs highlight differentially expressed genes.[57] For RNA-Seq, zero counts—common due to low expression or technical dropout—are handled by adding small pseudocounts (e.g., 1) before log transformation for fold change calculation, preventing undefined values, though testing models like DESeq2 avoid pseudocounts in likelihood-based inference.[57] Power to detect changes depends on sample size, sequencing depth, and effect size; for instance, detecting a twofold change (log2FC = 1) at 80% power and FDR < 0.05 often requires at least three replicates per group for moderately expressed genes. In practice, this analysis has revealed upregulated genes in cancer tissues compared to normal, such as proliferation markers like MKI67 in breast tumors, aiding molecular classification. Early microarray studies on acute leukemias identified sets of upregulated oncogenes distinguishing subtypes, demonstrating the method's utility in biomarker discovery.Statistical and Computational Tools
Gene expression profiling generates high-dimensional datasets where the number of genes often exceeds the number of samples, necessitating advanced statistical and computational tools to uncover patterns and make predictions. Unsupervised learning methods, such as clustering and dimensionality reduction, are fundamental for exploring inherent structures in these data without predefined labels. Clustering algorithms group genes or samples based on similarity in expression profiles; for instance, hierarchical clustering builds a tree-like structure to reveal nested relationships, while k-means partitioning assigns data points to a fixed number of clusters by minimizing intra-cluster variance. These techniques have been pivotal since the late 1990s, enabling the identification of co-expressed gene modules and sample subtypes in microarray experiments.[58] Dimensionality reduction complements clustering by projecting high-dimensional data into lower-dimensional spaces to mitigate noise and enhance visualization. Principal Component Analysis (PCA) achieves this linearly by identifying directions of maximum variance, commonly used as a preprocessing step in gene expression workflows to retain the top principal components that capture most variability. Nonlinear methods like t-distributed Stochastic Neighbor Embedding (t-SNE) preserve local structures for visualizing clusters in two or three dimensions, particularly effective for single-cell RNA-seq data where it reveals cell type separations. Uniform Manifold Approximation and Projection (UMAP) offers a faster alternative to t-SNE, balancing local and global data structures while scaling better to large datasets, as demonstrated in comparative evaluations of transcriptomic analyses.[59] Supervised learning methods leverage labeled data to train models for classification or regression tasks in gene expression profiling. Support Vector Machines (SVMs) construct hyperplanes to separate classes with maximum margins, proving robust for phenotype prediction from expression profiles, such as distinguishing cancer subtypes, through efficient handling of high-dimensional inputs via kernel tricks. Random Forests, an ensemble of decision trees, aggregate predictions to reduce overfitting and provide variable importance rankings, widely applied in genomic classification for tasks like tumor identification with high accuracy on microarray data. Regression variants, including ridge or lasso, predict continuous traits like drug response by penalizing coefficients to address multicollinearity in expression matrices.[60][61] Advanced techniques extend these approaches to infer complex relationships and enhance interpretability. Weighted Gene Co-expression Network Analysis (WGCNA) constructs scale-free networks from pairwise gene correlations, using soft thresholding to identify modules of co-expressed genes that correlate with traits, as formalized in its foundational framework for microarray and RNA-seq data. For machine learning models in the 2020s, SHapley Additive exPlanations (SHAP) quantifies feature contributions to predictions, aiding interpretability in genomic applications like variant effect scoring by attributing importance to specific genes or interactions.[62][63] Software ecosystems facilitate implementation of these tools. In R, Bioconductor packages like clusterProfiler support clustering and downstream exploration of gene groups, integrating statistical tests for profile comparisons. Python's Scanpy toolkit streamlines single-cell RNA-seq analysis, incorporating UMAP, Leiden clustering, and batch correction for scalable processing of millions of cells.[64][65] High-dimensionality poses the "curse of dimensionality," where sparse data leads to overfitting and unreliable distances; mitigation strategies include feature selection to retain informative genes and embedding into lower dimensions via PCA or autoencoders before modeling. Recent advancements in gene pair methods, focusing on ratios or differences between expression levels of gene pairs, have improved biomarker discovery by reducing dimensionality while preserving relational information, as demonstrated in a 2025 review of gene pair methods in clinical research advancing precision medicine and in 2025 studies applying them to cancer subtyping.[66][67] These approaches yield robust signatures with fewer features than single-gene models, enhancing predictive power in heterogeneous datasets. As of 2025, integration of deep learning models, such as graph neural networks for co-expression analysis, has further advanced pattern detection in large-scale transcriptomic data.[68]Functional Annotation and Pathway Analysis
Functional annotation involves mapping differentially expressed genes to known biological functions, processes, and components using standardized ontologies and databases. The Gene Ontology (GO) consortium provides a structured vocabulary for annotating genes across three domains: molecular function, biological process, and cellular component, enabling systematic classification of gene products.[69] Databases such as UniProt integrate GO terms with protein sequence and functional data, while Ensembl and NCBI Gene offer comprehensive gene annotations derived from experimental evidence, computational predictions, and literature curation.[70][71][72] In RNA-Seq profiling, handling transcript isoforms is crucial, as multiple isoforms per gene can contribute to expression variability; tools often aggregate isoform-level counts to gene-level summaries or use isoform-specific quantification to avoid underestimating functional diversity.[73] Tools like DAVID and g:Profiler facilitate ontology assignment by integrating multiple annotation sources for high-throughput analysis of gene lists. DAVID clusters functionally related genes and terms into biological modules, supporting GO enrichment alongside other annotations from over 40 databases.[74] g:Profiler performs functional profiling by mapping genes to GO terms, pathways, and regulatory motifs, with support for over 500 organisms and regular updates from Ensembl.[75] These tools assign annotations based on evidence codes, prioritizing experimentally validated terms to ensure reliability in interpreting expression profiles. Pathway analysis extends annotation by identifying coordinated changes in biological pathways, using enrichment tests to detect over-representation of profiled genes in predefined sets. Common databases include KEGG, which maps genes to metabolic and signaling pathways, and Reactome, focusing on detailed reaction networks. Over-representation analysis (ORA) applies to lists of differentially expressed genes, employing the hypergeometric test (equivalent to Fisher's exact test) to compute significance:where is the total number of genes, the number in the pathway, the number of differentially expressed genes, and the observed overlap.[76] This test assesses whether pathway genes are enriched beyond chance, with multiple-testing corrections like Benjamini-Hochberg to control false positives. Gene Set Enrichment Analysis (GSEA) complements ORA by evaluating ranked gene lists from full expression profiles, detecting subtle shifts in pathway activity without arbitrary significance cutoffs.[77] GSEA uses a Kolmogorov-Smirnov-like statistic to measure enrichment at the top or bottom of the ranking, weighted by gene metric, and permutes phenotypes to estimate empirical p-values. For example, in cancer studies, GSEA has revealed upregulation of the PI3K-AKT pathway, linking altered expression of genes like PIK3CA and AKT1 to tumor proliferation and survival.[78] Regulated genes are often categorized by function to highlight regulatory mechanisms, such as grouping into transcription factors (e.g., E2F family regulating cell cycle and apoptosis genes) or apoptosis-related sets (e.g., BCL2 family modulators).[79] This categorization integrates annotations to infer upstream regulators and downstream effects, aiding in the interpretation of co-regulated patterns in expression profiles. As of 2025, advancements in pathway analysis include AI-driven tools for dynamic pathway modeling, enhancing predictions of pathway perturbations in disease contexts.[80]
