Hubbry Logo
Gene expressionGene expressionMain
Open search
Gene expression
Community hub
Gene expression
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Gene expression
Gene expression
from Wikipedia

Gene expression is the process by which the information contained within a gene is used to produce a functional gene product, such as a protein or a functional RNA molecule. This process involves multiple steps, including the transcription of the gene's sequence into RNA. For protein-coding genes, this RNA is further translated into a chain of amino acids that folds into a protein, while for non-coding genes, the resulting RNA itself serves a functional role in the cell. Gene expression enables cells to utilize the genetic information in genes to carry out a wide range of biological functions. While expression levels can be regulated in response to cellular needs and environmental changes, some genes are expressed continuously with little variation.[1]

Mechanism

[edit]

Transcription

[edit]
RNA polymerase moving along a stretch of DNA, leaving behind newly synthetized strand of RNA.
The process of transcription is carried out by RNA polymerase (RNAP), which uses DNA (black) as a template and produces RNA (blue).

The production of a RNA copy from a DNA strand is called transcription, and is performed by RNA polymerases, which add one ribonucleotide at a time to a growing RNA strand as per the complementarity law of the nucleotide bases. This RNA is complementary to the template 3′ → 5′ DNA strand,[2] with the exception that thymines (T) are replaced with uracils (U) in the RNA and possible errors.

In bacteria, transcription is carried out by a single type of RNA polymerase, which needs to bind a DNA sequence called a Pribnow box with the help of the sigma factor protein (σ factor) to start transcription. In eukaryotes, transcription is performed in the nucleus by three types of RNA polymerases, each of which needs a special DNA sequence called the promoter and a set of DNA-binding proteins—transcription factors—to initiate the process (see regulation of transcription below). RNA polymerase I is responsible for transcription of ribosomal RNA (rRNA) genes. RNA polymerase II (Pol II) transcribes all protein-coding genes but also some non-coding RNAs (e.g., snRNAs, snoRNAs or long non-coding RNAs). RNA polymerase III transcribes 5S rRNA, transfer RNA (tRNA) genes, and some small non-coding RNAs (e.g., 7SK). Transcription ends when the polymerase encounters a sequence called the terminator.

mRNA processing

[edit]

While transcription of prokaryotic protein-coding genes creates messenger RNA (mRNA) that is ready for translation into protein, transcription of eukaryotic genes leaves a primary transcript of RNA (pre-RNA), which first has to undergo a series of modifications to become a mature RNA. Types and steps involved in the maturation processes vary between coding and non-coding preRNAs; i.e. even though preRNA molecules for both mRNA and tRNA undergo splicing, the steps and machinery involved are different.[3] The processing of non-coding RNA is described below (non-coding RNA maturation).

The processing of pre-mRNA include 5′ capping, which is set of enzymatic reactions that add 7-methylguanosine (m7G) to the 5′ end of pre-mRNA and thus protect the RNA from degradation by exonucleases.[4] The m7G cap is then bound by cap binding complex heterodimer (CBP20/CBP80), which aids in mRNA export to cytoplasm and also protect the RNA from decapping.[5]

Another modification is 3′ cleavage and polyadenylation.[6] They occur if polyadenylation signal sequence (5′- AAUAAA-3′) is present in pre-mRNA, which is usually between protein-coding sequence and terminator.[7] The pre-mRNA is first cleaved and then a series of ~200 adenines (A) are added to form poly(A) tail, which protects the RNA from degradation.[8] The poly(A) tail is bound by multiple poly(A)-binding proteins (PABPs) necessary for mRNA export and translation re-initiation.[9] In the inverse process of deadenylation, poly(A) tails are shortened by the CCR4-Not 3′-5′ exonuclease, which often leads to full transcript decay.[10]

Pre-mRNA is spliced to form of mature mRNA.
Illustration of exons and introns in pre-mRNA and the formation of mature mRNA by splicing. The UTRs (in green) are non-coding parts of exons at the ends of the mRNA.

A very important modification of eukaryotic pre-mRNA is RNA splicing. The majority of eukaryotic pre-mRNAs consist of alternating segments called exons and introns.[11] During the process of splicing, an RNA-protein catalytical complex known as spliceosome catalyzes two transesterification reactions, which remove an intron and release it in form of lariat structure, and then splice neighbouring exons together.[12] In certain cases, some introns or exons can be either removed or retained in mature mRNA.[13] This so-called alternative splicing creates series of different transcripts originating from a single gene. Because these transcripts can be potentially translated into different proteins, splicing extends the complexity of eukaryotic gene expression and the size of a species proteome.[14]

Extensive RNA processing may be an evolutionary advantage made possible by the nucleus of eukaryotes. In prokaryotes, transcription and translation happen together, whilst in eukaryotes, the nuclear membrane separates the two processes, giving time for RNA processing to occur.[15]

Non-coding RNA maturation

[edit]

In most organisms non-coding genes (ncRNA) are transcribed as precursors that undergo further processing. In the case of ribosomal RNAs (rRNA), they are often transcribed as a pre-rRNA that contains one or more rRNAs. The pre-rRNA is cleaved and modified (2′-O-methylation and pseudouridine formation) at specific sites by approximately 150 different small nucleolus-restricted RNA species, called snoRNAs. SnoRNAs associate with proteins, forming snoRNPs. While snoRNA part basepair with the target RNA and thus position the modification at a precise site, the protein part performs the catalytical reaction. In eukaryotes, in particular a snoRNP called RNase, MRP cleaves the 45S pre-rRNA into the 28S, 5.8S, and 18S rRNAs. The rRNA and RNA processing factors form large aggregates called the nucleolus.[16]

In the case of transfer RNA (tRNA), for example, the 5′ sequence is removed by RNase P,[17] whereas the 3′ end is removed by the tRNase Z enzyme[18] and the non-templated 3′ CCA tail is added by a nucleotidyl transferase.[19] In the case of micro RNA (miRNA), miRNAs are first transcribed as primary transcripts or pri-miRNA with a cap and poly-A tail and processed to short, 70-nucleotide stem-loop structures known as pre-miRNA in the cell nucleus by the enzymes Drosha and Pasha. After being exported, it is then processed to mature miRNAs in the cytoplasm by interaction with the endonuclease Dicer, which also initiates the formation of the RNA-induced silencing complex (RISC), composed of the Argonaute protein.

Even snRNAs and snoRNAs themselves undergo series of modification before they become part of functional RNP complex.[20] This is done either in the nucleoplasm or in the specialized compartments called Cajal bodies.[21] Their bases are methylated or pseudouridinilated by a group of small Cajal body-specific RNAs (scaRNAs), which are structurally similar to snoRNAs.[22]

Translation

[edit]

For some non-coding RNA, the mature RNA is the final gene product.[23] In the case of messenger RNA (mRNA) the RNA is an information carrier coding for the synthesis of one or more proteins. mRNA carrying a single protein sequence (common in eukaryotes) is monocistronic whilst mRNA carrying multiple protein sequences (common in prokaryotes) is known as polycistronic.

Ribosome translating messenger RNA to chain of amino acids (protein).
During the translation, tRNA charged with amino acid enters the ribosome and aligns with the correct mRNA triplet. Ribosome then adds amino acid to growing protein chain.

Every mRNA consists of three parts: a 5′ untranslated region (5′UTR), a protein-coding region or open reading frame (ORF), and a 3′ untranslated region (3′UTR). The coding region carries information for protein synthesis encoded by the genetic code to form triplets. Each triplet of nucleotides of the coding region is called a codon and corresponds to a binding site complementary to an anticodon triplet in transfer RNA. Transfer RNAs with the same anticodon sequence always carry an identical type of amino acid. Amino acids are then chained together by the ribosome according to the order of triplets in the coding region. The ribosome helps transfer RNA to bind to messenger RNA and takes the amino acid from each transfer RNA and makes a structure-less protein out of it.[24][25] Each mRNA molecule is translated into many protein molecules, on average ~2800 in mammals.[26][27]

In prokaryotes translation generally occurs at the point of transcription (co-transcriptionally), often using a messenger RNA that is still in the process of being created. In eukaryotes translation can occur in a variety of regions of the cell depending on where the protein being written is supposed to be. Major locations are the cytoplasm for soluble cytoplasmic proteins and the membrane of the endoplasmic reticulum for proteins that are for export from the cell or insertion into a cell membrane. Proteins that are supposed to be produced at the endoplasmic reticulum are recognised part-way through the translation process. This is governed by the signal recognition particle—a protein that binds to the ribosome and directs it to the endoplasmic reticulum when it finds a signal peptide on the growing (nascent) amino acid chain.[28]

Regulation

[edit]
A cat with patches of orange and black fur.
The patchy colours of a tortoiseshell cat are the result of different levels of expression of pigmentation genes in different areas of the skin.

Regulation of gene expression is the control of the amount and timing of appearance of the functional product of a gene. Control of expression is vital to allow a cell to produce the gene products it needs when it needs them; in turn, this gives cells the flexibility to adapt to a variable environment, external signals, damage to the cell, and other stimuli. More generally, gene regulation gives the cell control over all structure and function, and is the basis for cellular differentiation, morphogenesis and the versatility and adaptability of any organism.

Numerous terms are used to describe types of genes depending on how they are regulated; these include:

  • A constitutive gene is a gene that is transcribed continually as opposed to a facultative gene, which is only transcribed when needed.
  • A housekeeping gene is a gene that is required to maintain basic cellular function and so is typically expressed in all cell types of an organism. Examples include actin, GAPDH and ubiquitin. Some housekeeping genes are transcribed at a relatively constant rate and these genes can be used as a reference point in experiments to measure the expression rates of other genes.
  • A facultative gene is a gene only transcribed when needed as opposed to a constitutive gene.
  • An inducible gene is a gene whose expression is either responsive to environmental change or dependent on the position in the cell cycle.

Any step of gene expression may be modulated, from the DNA-RNA transcription step to post-translational modification of a protein. The stability of the final gene product, whether it is RNA or protein, also contributes to the expression level of the gene—an unstable product results in a low expression level. In general gene expression is regulated through changes[29] in the number and type of interactions between molecules[30] that collectively influence transcription of DNA[31] and translation of RNA.[32]

Some simple examples of where gene expression is important are:

Transcriptional

[edit]
When lactose is present in a prokaryote, it acts as an inducer and inactivates the repressor so that the genes for lactose metabolism can be transcribed.

Regulation of transcription can be broken down into three main routes of influence; genetic (direct interaction of a control factor with the gene), modulation interaction of a control factor with the transcription machinery and epigenetic (non-sequence changes in DNA structure that influence transcription).[33][34]

Ribbon diagram of the lambda repressor dimer bound to DNA.
The lambda repressor transcription factor (green) binds as a dimer to major groove of DNA target (red and blue) and disables initiation of transcription. From PDB: 1LMB​.

Direct interaction with DNA is the simplest and the most direct method by which a protein changes transcription levels.[35] Genes often have several protein binding sites around the coding region with the specific function of regulating transcription.[36] There are many classes of regulatory DNA binding sites known as enhancers, insulators and silencers.[37] The mechanisms for regulating transcription are varied, from blocking key binding sites on the DNA for RNA polymerase to acting as an activator and promoting transcription by assisting RNA polymerase binding.[38]

The activity of transcription factors is further modulated by intracellular signals causing protein post-translational modification including phosphorylation, acetylation, or glycosylation.[39] These changes influence a transcription factor's ability to bind, directly or indirectly, to promoter DNA, to recruit RNA polymerase, or to favor elongation of a newly synthesized RNA molecule.[40]

The nuclear membrane in eukaryotes allows further regulation of transcription factors by the duration of their presence in the nucleus, which is regulated by reversible changes in their structure and by binding of other proteins.[41] Environmental stimuli or endocrine signals[42] may cause modification of regulatory proteins[43] eliciting cascades of intracellular signals,[44] which result in regulation of gene expression.

It has become apparent that there is a significant influence of non-DNA-sequence specific effects on transcription.[45] These effects are referred to as epigenetic and involve the higher order structure of DNA, non-sequence specific DNA binding proteins and chemical modification of DNA.[46] In general epigenetic effects alter the accessibility of DNA to proteins and so modulate transcription.[47]

A cartoon representation of the nucleosome structure.
In eukaryotes, DNA is organized in form of nucleosomes. Note how the DNA (blue and green) is tightly wrapped around the protein core made of histone octamer (ribbon coils), restricting access to the DNA. From PDB: 1KX5​.

In eukaryotes the structure of chromatin, controlled by the histone code, regulates access to DNA with significant impacts on the expression of genes in euchromatin and heterochromatin areas.[48]

Enhancers, transcription factors, mediator complex and DNA loops

[edit]
Regulation of transcription in mammals. An active enhancer regulatory region is enabled to interact with the promoter region of its target gene by formation of a chromosome loop. This can initiate messenger RNA (mRNA) synthesis by RNA polymerase II (RNAP II) bound to the promoter at the transcription start site of the gene. The loop is stabilized by one architectural protein anchored to the enhancer and one anchored to the promoter and these proteins are joined to form a dimer (red zigzags). Specific regulatory transcription factors bind to DNA sequence motifs on the enhancer. General transcription factors bind to the promoter. When a transcription factor is activated by a signal (here indicated as phosphorylation shown by a small red star on a transcription factor on the enhancer) the enhancer is activated and can now activate its target promoter. The active enhancer is transcribed on each strand of DNA in opposite directions by bound RNAP IIs. Mediator proteins (a complex consisting of about 26 proteins in an interacting structure) communicate regulatory signals from the enhancer DNA-bound transcription factors to the promoter.

Gene expression in mammals is regulated by many cis-regulatory elements, including core promoters and promoter-proximal elements that are located near the transcription start sites of genes, upstream on the DNA (towards the 5' region of the sense strand). Other important cis-regulatory modules are localized in DNA regions that are distant from the transcription start sites. These include enhancers, silencers, insulators and tethering elements.[49] Enhancers and their associated transcription factors have a leading role in the regulation of gene expression.[50]

Enhancers are genome regions that regulate genes. Enhancers control cell-type-specific gene expression programs, most often by looping through long distances to come in physical proximity with the promoters of their target genes.[51] Multiple enhancers, each often tens or hundred of thousands of nucleotides distant from their target genes, loop to their target gene promoters and coordinate with each other to control gene expression.[51]

The illustration shows an enhancer looping around to come into proximity with the promoter of a target gene. The loop is stabilized by a dimer of a connector protein (e.g. dimer of CTCF or YY1). One member of the dimer is anchored to its binding motif on the enhancer and the other member is anchored to its binding motif on the promoter (represented by the red zigzags in the illustration).[52] Several cell function-specific transcription factors (among the about 1,600 transcription factors in a human cell)[53] generally bind to specific motifs on an enhancer.[54] A small combination of these enhancer-bound transcription factors, when brought close to a promoter by a DNA loop, govern transcription level of the target gene. Mediator (a complex usually consisting of about 26 proteins in an interacting structure) communicates regulatory signals from enhancer DNA-bound transcription factors directly to the RNA polymerase II (pol II) enzyme bound to the promoter.[55]

Enhancers, when active, are generally transcribed from both strands of DNA with RNA polymerases acting in two different directions, producing two eRNAs as illustrated in the figure.[56] An inactive enhancer may be bound by an inactive transcription factor. Phosphorylation of the transcription factor may activate it and that activated transcription factor may then activate the enhancer to which it is bound (see small red star representing phosphorylation of transcription factor bound to enhancer in the illustration).[57] An activated enhancer begins transcription of its RNA before activating transcription of messenger RNA from its target gene.[58]

DNA methylation and demethylation

[edit]
DNA methylation is the addition of a methyl group to the DNA that happens at cytosine. The image shows a cytosine single ring base and a methyl group added on to the 5 carbon. In mammals, DNA methylation occurs almost exclusively at a cytosine that is followed by a guanine.

DNA methylation is a widespread mechanism for epigenetic influence on gene expression and is seen in bacteria and eukaryotes and has roles in heritable transcription silencing and transcription regulation. Methylation most often occurs on a cytosine (see Figure). Methylation of cytosine primarily occurs in dinucleotide sequences where a cytosine is followed by a guanine, a CpG site. The number of CpG sites in the human genome is about 28 million.[59] Depending on the type of cell, about 70% of the CpG sites have a methylated cytosine.[60]

Methylation of cytosine in DNA has a major role in regulating gene expression. Methylation of CpGs in a promoter region of a gene usually represses gene transcription[61] while methylation of CpGs in the body of a gene increases expression.[62] TET enzymes play a central role in demethylation of methylated cytosines. Demethylation of CpGs in a gene promoter by TET enzyme activity increases transcription of the gene.[63]

In learning and memory

[edit]
The identified areas of the human brain are involved in memory formation.

In a rat, contextual fear conditioning (CFC) is a painful learning experience. Just one episode of CFC can result in a life-long fearful memory.[64] After an episode of CFC, cytosine methylation is altered in the promoter regions of about 9.17% of all genes in the hippocampus neuron DNA of a rat.[65] The hippocampus is where new memories are initially stored. After CFC about 500 genes have increased transcription (often due to demethylation of CpG sites in a promoter region) and about 1,000 genes have decreased transcription (often due to newly formed 5-methylcytosine at CpG sites in a promoter region). The pattern of induced and repressed genes within neurons appears to provide a molecular basis for forming the first transient memory of this training event in the hippocampus of the rat brain.[65]

Some specific mechanisms guiding new DNA methylations and new DNA demethylations in the hippocampus during memory establishment have been established (see[66] for summary). One mechanism includes guiding the short isoform of the TET1 DNA demethylation enzyme, TET1s, to about 600 locations on the genome. The guidance is performed by association of TET1s with EGR1 protein, a transcription factor important in memory formation. Bringing TET1s to these locations initiates DNA demethylation at those sites, up-regulating associated genes. A second mechanism involves DNMT3A2, a splice-isoform of DNA methyltransferase DNMT3A, which adds methyl groups to cytosines in DNA. This isoform is induced by synaptic activity, and its location of action appears to be determined by histone post-translational modifications (a histone code). The resulting new messenger RNAs are then transported by messenger RNP particles (neuronal granules) to synapses of the neurons, where they can be translated into proteins affecting the activities of synapses.[66]

In particular, the brain-derived neurotrophic factor gene (BDNF) is known as a "learning gene".[67] After CFC there was upregulation of BDNF gene expression, related to decreased CpG methylation of certain internal promoters of the gene, and this was correlated with learning.[67]

In cancer

[edit]

The majority of gene promoters contain a CpG island with numerous CpG sites.[68] When many of a gene's promoter CpG sites are methylated the gene becomes silenced.[69] Colorectal cancers typically have 3 to 6 driver mutations and 33 to 66 hitchhiker or passenger mutations.[70] However, transcriptional silencing may be of more importance than mutation in causing progression to cancer. For example, in colorectal cancers about 600 to 800 genes are transcriptionally silenced by CpG island methylation (see regulation of transcription in cancer). Transcriptional repression in cancer can also occur by other epigenetic mechanisms, such as altered expression of microRNAs.[71] In breast cancer, transcriptional repression of BRCA1 may occur more frequently by over-transcribed microRNA-182 than by hypermethylation of the BRCA1 promoter (see Low expression of BRCA1 in breast and ovarian cancers).

Post-transcriptional regulation

[edit]

In eukaryotes, where export of RNA is required before translation is possible, nuclear export is thought to provide additional control over gene expression. All transport in and out of the nucleus is via the nuclear pore and transport is controlled by a wide range of importin and exportin proteins.[72]

Expression of a gene coding for a protein is only possible if the messenger RNA carrying the code survives long enough to be translated.[73] In a typical cell, an RNA molecule is only stable if specifically protected from degradation.[74] RNA degradation has particular importance in regulation of expression in eukaryotic cells where mRNA has to travel significant distances before being translated.[75] In eukaryotes, RNA is stabilised by certain post-transcriptional modifications, particularly the 5′ cap and poly-adenylated tail.[76]

Intentional degradation of mRNA is used not just as a defence mechanism from foreign RNA (normally from viruses) but also as a route of mRNA destabilisation.[77] If an mRNA molecule has a complementary sequence to a small interfering RNA then it is targeted for destruction via the RNA interference pathway.[78]

Three prime untranslated regions and microRNAs

[edit]

Three prime untranslated regions (3′UTRs) of messenger RNAs (mRNAs) often contain regulatory sequences that post-transcriptionally influence gene expression. Such 3′-UTRs often contain both binding sites for microRNAs (miRNAs) as well as for regulatory proteins.[79] By binding to specific sites within the 3′-UTR, miRNAs can decrease gene expression of various mRNAs by either inhibiting translation or directly causing degradation of the transcript.[80] The 3′-UTR also may have silencer regions that bind repressor proteins that inhibit the expression of a mRNA.[81]

The 3′-UTR often contains microRNA response elements (MREs). MREs are sequences to which miRNAs bind. These are prevalent motifs within 3′-UTRs. Among all regulatory motifs within the 3′-UTRs (e.g. including silencer regions), MREs make up about half of the motifs.[82]

As of 2014, the miRBase web site,[83] an archive of miRNA sequences and annotations, listed 28,645 entries in 233 biologic species. Of these, 1,881 miRNAs were in annotated human miRNA loci. miRNAs were predicted to have an average of about four hundred target mRNAs (affecting expression of several hundred genes).[84] Friedman et al.[84] estimate that >45,000 miRNA target sites within human mRNA 3′UTRs are conserved above background levels, and >60% of human protein-coding genes have been under selective pressure to maintain pairing to miRNAs.

Direct experiments show that a single miRNA can reduce the stability of hundreds of unique mRNAs.[85] Other experiments show that a single miRNA may repress the production of hundreds of proteins, but that this repression often is relatively mild (less than 2-fold).[86][87]

The effects of miRNA dysregulation of gene expression seem to be important in cancer.[88] For instance, in gastrointestinal cancers, nine miRNAs have been identified as epigenetically altered and effective in down regulating DNA repair enzymes.[89]

The effects of miRNA dysregulation of gene expression also seem to be important in neuropsychiatric disorders, such as schizophrenia, bipolar disorder, major depression, Parkinson's disease, Alzheimer's disease and autism spectrum disorders.[90][91]

Translational

[edit]
A chemical structure of neomycin molecule.
Neomycin is an example of a small molecule that reduces expression of all protein genes inevitably leading to cell death; it thus acts as an antibiotic.

Direct regulation of translation is less prevalent than control of transcription or mRNA stability but is occasionally used.[92] Inhibition of protein translation is a major target for toxins and antibiotics, so they can kill a cell by overriding its normal gene expression control.[93] Protein synthesis inhibitors include the antibiotic neomycin and the toxin ricin.[94]

Post-translational modifications

[edit]

Post-translational modifications (PTMs) are covalent modifications to proteins. Like RNA splicing, they help to significantly diversify the proteome. These modifications are usually catalyzed by enzymes. Additionally, processes like covalent additions to amino acid side chain residues can often be reversed by other enzymes. However, some, like the proteolytic cleavage of the protein backbone, are irreversible.[95]

PTMs play many important roles in the cell.[96] For example, phosphorylation is primarily involved in activating and deactivating proteins and in signaling pathways.[97] PTMs are involved in transcriptional regulation: an important function of acetylation and methylation is histone tail modification, which alters how accessible DNA is for transcription.[95] They can also be seen in the immune system, where glycosylation plays a key role.[98] One type of PTM can initiate another type of PTM, as can be seen in how ubiquitination tags proteins for degradation through proteolysis.[95] Proteolysis, other than being involved in breaking down proteins, is also important in activating and deactivating them, and in regulating biological processes such as DNA transcription and cell death.[99]

Measurement

[edit]
Schematic karyogram of a human, showing an overview of the expression of the human genome using G banding, which is a method that includes Giemsa staining, wherein the lighter staining regions are generally more transcriptionally active, whereas darker regions are more inactive.

Measuring gene expression is an important part of many life sciences, as the ability to quantify the level at which a particular gene is expressed within a cell, tissue or organism can provide a lot of valuable information. For example, measuring gene expression can:

Similarly, the analysis of the location of protein expression is a powerful tool, and this can be done on an organismal or cellular scale. Investigation of localization is particularly important for the study of development in multicellular organisms and as an indicator of protein function in single cells. Ideally, measurement of expression is done by detecting the final gene product (for many genes, this is the protein); however, it is often easier to detect one of the precursors, typically mRNA and to infer gene-expression levels from these measurements.

mRNA quantification

[edit]

Levels of mRNA can be quantitatively measured by northern blotting, which provides size and sequence information about the mRNA molecules.[100] A sample of RNA is separated on an agarose gel and hybridized to a radioactively labeled RNA probe that is complementary to the target sequence.[101] The radiolabeled RNA is then detected by an autoradiograph.[102] Because the use of radioactive reagents makes the procedure time-consuming and potentially dangerous, alternative labeling and detection methods, such as digoxigenin and biotin chemistries, have been developed.[103] Perceived disadvantages of Northern blotting are that large quantities of RNA are required and that quantification may not be completely accurate, as it involves measuring band strength in an image of a gel.[104] On the other hand, the additional mRNA size information from the Northern blot allows the discrimination of alternately spliced transcripts.[105][106]

Another approach for measuring mRNA abundance is RT-qPCR. In this technique, reverse transcription is followed by quantitative PCR. Reverse transcription first generates a DNA template from the mRNA; this single-stranded template is called cDNA. The cDNA template is then amplified in the quantitative step, during which the fluorescence emitted by labeled hybridization probes or intercalating dyes changes as the DNA amplification process progresses.[107] With a carefully constructed standard curve, qPCR can produce an absolute measurement of the number of copies of original mRNA, typically in units of copies per nanolitre of homogenized tissue or copies per cell.[108] qPCR is very sensitive (detection of a single mRNA molecule is theoretically possible), but can be expensive depending on the type of reporter used; fluorescently labeled oligonucleotide probes are more expensive than non-specific intercalating fluorescent dyes.[109]

For expression profiling, or high-throughput analysis of many genes within a sample, quantitative PCR may be performed for hundreds of genes simultaneously in the case of low-density arrays.[110] A second approach is the hybridization microarray. A single array or "chip" may contain probes to determine transcript levels for every known gene in the genome of one or more organisms.[111] Alternatively, "tag based" technologies like Serial analysis of gene expression (SAGE) and RNA-Seq, which can provide a relative measure of the cellular concentration of different mRNAs, can be used.[112] An advantage of tag-based methods is the "open architecture", allowing for the exact measurement of any transcript, with a known or unknown sequence.[113] Next-generation sequencing (NGS) such as RNA-Seq is another approach, producing vast quantities of sequence data that can be matched to a reference genome. Although NGS is comparatively time-consuming, expensive, and resource-intensive, it can identify single-nucleotide polymorphisms, splice-variants, and novel genes, and can also be used to profile expression in organisms for which little or no sequence information is available.[114]

Protein quantification

[edit]

For genes encoding proteins, the expression level can be directly assessed by a number of methods with some clear analogies to the techniques for mRNA quantification.

One of the most commonly used methods is to perform a Western blot against the protein of interest.[115] This gives information on the size of the protein in addition to its identity. A sample (often cellular lysate) is separated on a polyacrylamide gel, transferred to a membrane and then probed with an antibody to the protein of interest. The antibody can either be conjugated to a fluorophore or to horseradish peroxidase for imaging and/or quantification. The gel-based nature of this assay makes quantification less accurate, but it has the advantage of being able to identify later modifications to the protein, for example proteolysis or ubiquitination, from changes in size.

mRNA-protein correlation

[edit]

While transcription directly reflects gene expression, the copy number of mRNA molecules does not directly correlate with the number of protein molecules translated from mRNA. Quantification of both protein and mRNA permits a correlation of the two levels. Regulation on each step of gene expression can impact the correlation, as shown for regulation of translation[27] or protein stability.[116] Post-translational factors, such as protein transport in highly polar cells,[117] can influence the measured mRNA-protein correlation as well.

Localization

[edit]
Visualization of hunchback mRNA in Drosophila embryo.
In situ-hybridization of Drosophila embryos at different developmental stages for the mRNA responsible for the expression of hunchback. High intensity of blue color marks places with high hunchback mRNA quantity.

Analysis of expression is not limited to quantification; localization can also be determined. mRNA can be detected with a suitably labelled complementary mRNA strand and protein can be detected via labelled antibodies. The probed sample is then observed by microscopy to identify where the mRNA or protein is.

A ribbon diagram of green fluorescent protein resembling barrel structure.
The three-dimensional structure of green fluorescent protein. The residues in the centre of the "barrel" are responsible for production of green light after exposing to higher energetic blue light. From PDB: 1EMA​.

By replacing the gene with a new version fused to a green fluorescent protein marker or similar, expression may be directly quantified in live cells. This is done by imaging using a fluorescence microscope. It is very difficult to clone a GFP-fused protein into its native location in the genome without affecting expression levels, so this method often cannot be used to measure endogenous gene expression. It is, however, widely used to measure the expression of a gene artificially introduced into the cell, for example via an expression vector. By fusing a target protein to a fluorescent reporter, the protein's behavior, including its cellular localization and expression level, can be significantly changed.

The enzyme-linked immunosorbent assay works by using antibodies immobilised on a microtiter plate to capture proteins of interest from samples added to the well. Using a detection antibody conjugated to an enzyme or fluorophore the quantity of bound protein can be accurately measured by fluorometric or colourimetric detection. The detection process is very similar to that of a Western blot, but by avoiding the gel steps more accurate quantification can be achieved.

Expression system

[edit]
Tet-ON inducible shRNA system

An expression system is a system specifically designed for the production of a gene product of choice. This is normally a protein although may also be RNA, such as tRNA or a ribozyme. An expression system consists of a gene, normally encoded by DNA, and the molecular machinery required to transcribe the DNA into mRNA and translate the mRNA into protein using the reagents provided. In the broadest sense this includes every living cell but the term is more normally used to refer to expression as a laboratory tool. An expression system is therefore often artificial in some manner. Expression systems are, however, a fundamentally natural process. Viruses are an excellent example where they replicate by using the host cell as an expression system for the viral proteins and genome.

Inducible expression

[edit]

Doxycycline is also used in "Tet-on" and "Tet-off" tetracycline controlled transcriptional activation to regulate transgene expression in organisms and cell cultures.

In nature

[edit]

In addition to these biological tools, certain naturally observed configurations of DNA (genes, promoters, enhancers, repressors) and the associated machinery itself are referred to as an expression system. This term is normally used in the case where a gene or set of genes is switched on under well defined conditions, for example, the simple repressor switch expression system in Lambda phage and the lac operator system in bacteria. Several natural expression systems are directly used or modified and used for artificial expression systems such as the Tet-on and Tet-off expression system.

Gene networks

[edit]

Genes have sometimes been regarded as nodes in a network, with inputs being proteins such as transcription factors, and outputs being the level of gene expression. The node itself performs a function, and the operation of these functions have been interpreted as performing a kind of information processing within cells and determines cellular behavior.

Gene networks can also be constructed without formulating an explicit causal model. This is often the case when assembling networks from large expression data sets.[118] Covariation and correlation of expression is computed across a large sample of cases and measurements (often transcriptome or proteome data). The source of variation can be either experimental or natural (observational). There are several ways to construct gene expression networks, but one common approach is to compute a matrix of all pair-wise correlations of expression across conditions, time points, or individuals and convert the matrix (after thresholding at some cut-off value) into a graphical representation in which nodes represent genes, transcripts, or proteins and edges connecting these nodes represent the strength of association (see GeneNetwork GeneNetwork 2).[119]

Techniques and tools

[edit]

The following experimental techniques are used to measure gene expression and are listed in roughly chronological order, starting with the older, more established technologies. They are divided into two groups based on their degree of multiplexity.

Gene expression databases

[edit]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Gene expression is the process by which the genetic information encoded in a gene's DNA sequence is converted into a functional product, such as a protein or non-coding RNA, primarily through the sequential steps of transcription and translation. In transcription, the enzyme RNA polymerase synthesizes a complementary messenger RNA (mRNA) strand from the DNA template within the nucleus in eukaryotic cells, copying the genetic code for export to the cytoplasm. Translation then occurs at ribosomes, where the mRNA is read in triplets (codons) to direct the assembly of amino acids into a polypeptide chain, forming the primary structure of a protein that folds into its functional form. This central dogma of molecular biology enables the manifestation of genetic traits and cellular functions, with only a subset of an organism's genes expressed in any given cell at a specific time. Gene expression is not constitutive but highly regulated to maintain cellular homeostasis and adapt to internal and external signals. Regulation occurs at multiple levels, including transcriptional control, where transcription factors bind to promoter regions of DNA to initiate or repress mRNA synthesis; post-transcriptional mechanisms, such as mRNA splicing, capping, polyadenylation, and degradation; and translational controls that modulate protein synthesis efficiency. Epigenetic modifications, like DNA methylation and histone acetylation, further influence chromatin accessibility, thereby fine-tuning gene activity without altering the underlying DNA sequence. These regulatory layers ensure precise spatiotemporal control, allowing multicellular organisms to develop diverse cell types from a single genome— for instance, neurons express genes for neurotransmitter receptors, while muscle cells prioritize those for contractile proteins. The study and manipulation of gene expression have profound implications for biology and medicine. Dysregulated expression underlies numerous diseases, including cancers driven by oncogene activation or tumor suppressor silencing, and genetic disorders like cystic fibrosis resulting from mutations in the CFTR gene. Techniques such as RNA sequencing and CRISPR-based editing have revolutionized the ability to profile and alter expression patterns, facilitating insights into development, evolution, and therapeutic interventions. Ultimately, gene expression orchestrates the complexity of life, bridging genotype to phenotype across all organisms.

Overview

Definition and importance

Gene expression is the process by which the information encoded in a gene's DNA sequence is converted into a functional product, primarily through the synthesis of RNA and proteins. This involves two main steps: transcription, where the DNA sequence is copied into (mRNA), and , where the mRNA sequence is decoded to produce a polypeptide chain that folds into a functional protein. The concept is encapsulated in the , proposed by , which posits that genetic information flows unidirectionally from DNA to RNA to protein, ensuring the faithful transmission and utilization of genetic instructions within cells. While this framework holds for most cellular processes, exceptions exist, such as reverse transcription in retroviruses, where RNA serves as a template for . Gene expression operates across multiple levels, extending beyond protein-coding genes to include the production of non-coding RNAs (ncRNAs), which do not translate into proteins but play crucial regulatory roles. These ncRNAs, such as microRNAs and long non-coding RNAs, modulate gene activity by influencing transcription, RNA stability, and chromatin structure, thereby fine-tuning cellular responses. The overall process thus encompasses the journey from DNA transcription to RNA maturation and, where applicable, protein synthesis, highlighting the versatility of genetic output in diverse biological contexts. The biological significance of gene expression cannot be overstated, as it underpins nearly every aspect of cellular and organismal function, from development and differentiation to environmental adaptation and . By selectively activating or repressing specific genes, cells achieve differentiation into specialized types, such as neurons or muscle cells, despite sharing the same . For instance, , a family of transcription factors, are expressed in precise spatial and temporal patterns during embryonic development to direct body patterning and segmentation in animals. Dysregulation of gene expression can lead to diseases like cancer, underscoring its essential role in maintaining physiological balance and responding to stimuli.

Historical development

The foundations of gene expression were laid in the early through experiments linking genes to biochemical functions. In 1941, and Tatum proposed the "one gene-one enzyme" hypothesis based on their studies of mutants, demonstrating that specific genes direct the production of individual enzymes involved in metabolic pathways. This idea built on earlier genetic work but shifted focus toward molecular mechanisms. Three years later, in 1944, , Colin MacLeod, and provided crucial evidence that DNA serves as the genetic material by showing that purified DNA from virulent pneumococci could transform non-virulent strains, ruling out proteins as the transforming principle. The molecular era began with the elucidation of DNA's structure in 1953 by and , who described and its base-pairing rules, implying a mechanism for genetic information storage and replication that underpins gene expression. This paved the way for understanding how genes are read. In 1961, François and introduced the concept of (mRNA) as an intermediary carrying genetic instructions from DNA to ribosomes for protein synthesis, detailed in their seminal paper on genetic regulation. That same year, and Monod proposed the lac operon model in E. coli, illustrating how genes are coordinately regulated through proteins that control transcription in response to environmental signals like . Concurrently, Marshall Nirenberg and J. Heinrich Matthaei cracked the first codon of the by using synthetic poly-uridylic acid RNA to direct incorporation of into proteins, revealing that UUU specifies and establishing RNA's role in translation. Subsequent decades revealed greater complexity, particularly in eukaryotes. In 1977, Phillip Sharp and Richard Roberts independently discovered introns—non-coding sequences interrupting eukaryotic genes—through electron microscopy of adenovirus RNA hybrids with DNA, showing that pre-mRNA is spliced to form mature mRNA. This finding challenged the continuity assumed from prokaryotic models and highlighted RNA processing as a key step in gene expression. Later milestones included the 1998 discovery of (RNAi) by and , who demonstrated that double-stranded RNA triggers sequence-specific degradation of homologous mRNAs in C. elegans, unveiling a natural mechanism for post-transcriptional . From 2012 onward, the adaptation of CRISPR-Cas9 by Martin Jinek, , , and enabled precise manipulation of gene expression by targeting and editing DNA sequences, revolutionizing studies of regulatory elements. These advances marked a progression from prokaryotic simplicity to eukaryotic intricacies, transforming gene expression from a genetic to a manipulable molecular process.

Molecular mechanisms

Transcription

Transcription is the first stage of gene expression, in which the genetic information encoded in DNA is copied into messenger RNA (mRNA) by the enzyme RNA polymerase. This process occurs in a template-dependent manner, where RNA polymerase synthesizes an RNA strand complementary to one of the DNA strands, following base-pairing rules: adenine (A) pairs with uracil (U) in RNA instead of thymine (T). Transcription is essential for converting the stable DNA blueprint into a transient RNA molecule that can be used for protein synthesis or other cellular functions. In prokaryotes, such as bacteria, transcription is carried out by a single type of RNA polymerase, a multi-subunit enzyme consisting of a core structure with five subunits (two α, one β, one β', and one ω) that catalyzes RNA synthesis. The core enzyme requires a sigma (σ) factor to form the holoenzyme, which enables specific promoter recognition. The primary σ factor, σ70 in Escherichia coli, binds to conserved promoter sequences, including the -10 box (TATAAT consensus) and the -35 box (TTGACA consensus), facilitating the initial binding of RNA polymerase to DNA. Different sigma factors allow recognition of alternative promoters, enabling responses to environmental changes. In eukaryotes, three distinct RNA polymerases handle transcription: RNA polymerase I (Pol I) synthesizes ribosomal RNA, Pol III produces transfer RNA and small RNAs, and RNA polymerase II (Pol II) transcribes mRNA and some non-coding RNAs. For mRNA synthesis, Pol II—a large complex with 12 subunits—relies on general transcription factors (GTFs) for promoter recognition and assembly of the pre-initiation complex (PIC). The core promoter often includes the TATA box (TATAAA consensus, located ~25-30 base pairs upstream of the transcription start site), to which the TATA-binding protein (TBP, a subunit of TFIID) binds, bending the DNA and recruiting other GTFs such as TFIIA, TFIIB, TFIIE, TFIIF, and TFIIH. TFIIH's helicase activity unwinds the DNA to form the open complex. The transcription process consists of three main phases: initiation, elongation, and termination. Initiation begins with promoter recognition and DNA unwinding to form the open complex, followed by the synthesis of the first few RNA nucleotides without promoter clearance in prokaryotes (abortive initiation) or stable PIC formation in eukaryotes. In prokaryotes, the sigma factor dissociates shortly after initiation, allowing the core enzyme to proceed; in eukaryotes, Pol II enters a promoter-proximal paused state before full clearance, regulated by factors like NELF and DSIF. During elongation, RNA polymerase moves along the DNA template at an average rate of approximately 40-50 nucleotides per second in prokaryotes and 20-40 nucleotides per second in eukaryotes, adding ribonucleotides to the growing 3' end of the RNA chain in the 5' to 3' direction. The enzyme maintains high fidelity through kinetic proofreading and induced-fit mechanisms, achieving an error rate of about 10^{-4} to 10^{-5} errors per nucleotide incorporated, which is lower than expected from base-pairing alone due to enhanced selectivity. In prokaryotes, elongation is coupled with translation, as ribosomes can bind nascent mRNA while it is still being transcribed, whereas in eukaryotes, transcription occurs in the nucleus, separated from translation in the cytoplasm. Termination signals the end of RNA synthesis and release of the transcript and polymerase. In prokaryotes, two main mechanisms exist: rho-independent termination, where a GC-rich loop forms in the RNA followed by a poly-U tract that weakens RNA-DNA interactions, or rho-dependent termination, involving the rho that translocates along the RNA and disrupts the elongation complex. In eukaryotes, Pol II termination is linked to the signal (AAUAAA) in the pre-mRNA, triggering cleavage and poly-A tail addition, followed by the torpedo mechanism where Rat1 exonuclease degrades the downstream RNA, leading to polymerase release.

RNA processing and maturation

In eukaryotic cells, RNA processing and maturation occur co-transcriptionally and post-transcriptionally to convert primary transcripts, known as pre-mRNAs, into functional mature RNAs capable of export from the nucleus and subsequent utilization in the . This multifaceted process ensures the removal of non-coding sequences, addition of protective modifications, and quality surveillance to prevent the accumulation of defective molecules. Key steps include 5' capping, 3' , splicing, and specific maturation pathways for non-coding RNAs, culminating in nuclear export primarily through dedicated transport receptors. The 5' capping of pre-mRNA involves the addition of a 7-methylguanosine (m7G) cap structure to the first via a 5'-5' triphosphate linkage, occurring shortly after transcription by . This modification is catalyzed by a tripartite complex: RNA triphosphatase removes the gamma , guanylyltransferase adds GMP, and guanine-7-methyltransferase methylates the at the N7 position. The cap enhances mRNA stability by protecting against 5' exonucleases and facilitates by recruiting 4E () in the . Polyadenylation at the 3' end entails cleavage of the pre-mRNA downstream of a signal (typically AAUAAA) followed by the addition of a poly(A) tail consisting of 50-250 residues. This tail is synthesized by poly(A) polymerase, which iteratively adds ATP without a template, in coordination with the cleavage and polyadenylation specificity factor (CPSF) and cleavage stimulation factor (CstF). The poly(A) tail promotes mRNA export from the nucleus, enhances stability by impeding 3' exonucleolytic degradation, and supports by interacting with poly(A)-binding protein (PABP), which circularizes the mRNA via cap-PABP bridging. Splicing removes introns and joins exons through the action of the spliceosome, a large ribonucleoprotein complex assembled stepwise on pre-mRNA introns marked by conserved 5' and 3' splice sites, branch point, and polypyrimidine tract. The spliceosome, comprising U1, U2, U4/U6, and U5 small nuclear ribonucleoproteins (snRNPs), catalyzes two transesterification reactions: the branch point adenosine attacks the 5' splice site to form a lariat intermediate, followed by 3' splice site cleavage and exon ligation. Alternative splicing, where different exon combinations are selected, generates multiple mRNA isoforms from a single gene, expanding proteomic diversity; up to 95% of human multi-exon genes undergo this process, enabling tissue-specific and developmental regulation. Maturation of non-coding RNAs follows specialized pathways distinct from mRNA processing. (rRNA) precursors are transcribed by and processed in the , where small nucleolar ribonucleoproteins (snoRNPs), particularly box C/D snoRNPs, guide 2'-O-methylation and pseudouridylation while facilitating cleavage at specific sites to yield mature 18S, 5.8S, and 28S rRNAs. (tRNA) maturation, occurring in both nucleus and , involves endonucleolytic trimming of 5' and 3' extensions by RNase P and other exonucleases, followed by the template-independent addition of a CCA sequence to the 3' terminus by tRNA nucleotidyltransferase, which is essential for aminoacylation by synthetases. Quality control mechanisms, such as (NMD), degrade aberrant transcripts harboring premature termination codons (PTCs) located more than 50-55 upstream of an exon-exon junction. NMD is triggered during pioneer translation rounds when the encounters a PTC, recruiting up-frameshift proteins (UPF1, UPF2, UPF3) and the (EJC) to mark the mRNA for rapid degradation by endonucleases and exonucleases, thereby preventing the synthesis of truncated, potentially harmful proteins. This surveillance pathway targets approximately 5-10% of human transcripts under normal conditions, including those from splicing errors. In eukaryotes, mature mRNAs are exported from the nucleus to the through nuclear pore complexes via receptor-mediated transport. The primary export receptor for most bulk mRNAs is NXF1 (TAP), which binds the mRNA via adaptor proteins like ALY/REF and interacts with nucleoporins; however, certain transcripts, such as unspliced viral mRNAs or specific cellular mRNAs, utilize exportins like CRM1 (exportin 1), which recognizes leucine-rich nuclear export signals in the presence of Ran-GTP to facilitate selective export. This export step is tightly coupled to prior processing events, ensuring only properly capped, polyadenylated, and spliced RNAs are transported.

Translation

Translation is the process by which the genetic information encoded in (mRNA) is decoded to synthesize proteins on .00725-0) This step occurs in the of prokaryotes and eukaryotes, utilizing the to specify the sequence of in the polypeptide chain. The core components involved include , transfer RNAs (tRNAs), and aminoacyl-tRNA synthetases. consist of two subunits: in prokaryotes, the small 30S subunit and large 50S subunit assemble into the 70S ribosome, while in eukaryotes, the and 60S subunits form the ribosome.00725-0) tRNAs serve as adaptor molecules that carry specific to the , with their anticodon regions base-pairing to mRNA codons; there are typically 20 aminoacyl-tRNA synthetases, one for each , that catalyze the attachment of to their cognate tRNAs with high specificity. The , elucidated through experiments using synthetic polynucleotides in cell-free systems, comprises 64 codons—triplet sequences of the four bases (A, U, G, C)—that specify 20 standard and three stop signals. The code exhibits degeneracy, meaning multiple codons can encode the same , primarily differing in the third position, which reduces the impact of certain . This degeneracy is explained by the wobble hypothesis, which posits that non-standard base pairing (wobble) at the third position of the codon-anticodon interaction allows a single tRNA to recognize multiple synonymous codons. The code is nearly universal across organisms, but exceptions exist, such as in mammalian mitochondria where AUA codes for instead of and UGA specifies rather than acting as a . Translation proceeds in three main stages: , elongation, and termination. begins with the assembly of the on the mRNA at the , AUG, which codes for . In prokaryotes, the small ribosomal subunit binds to the Shine-Dalgarno sequence, a -rich region 4–9 upstream of the AUG, facilitating precise positioning via complementarity to the 3' end of 16S rRNA; the initiator tRNA, charged with formyl-methionine, then binds to the . In eukaryotes, the subunit, along with initiation factors, binds near the 5' cap of the mRNA and scans downstream to the first AUG in a favorable context defined by the (typically GCCRCCAUGG, where R is a ), after which the initiator tRNA (Met-tRNAi) associates and the 60S subunit joins to form the complete .90500-5) During elongation, the ribosome moves along the mRNA in the 5' to 3' direction, incorporating amino acids sequentially. Aminoacyl-tRNAs enter the A site of the ribosome, where codon-anticodon matching triggers GTP hydrolysis by elongation factor EF-Tu (in prokaryotes) or eEF1A (in eukaryotes) for proofreading; accurate matches proceed to peptide bond formation catalyzed by the peptidyl transferase center (PTC), a ribozyme activity residing in the 23S rRNA (prokaryotes) or 28S rRNA (eukaryotes) of the large subunit. The nascent peptide chain transfers from the P-site tRNA to the amino acid in the A site, forming a new bond, after which elongation factor EF-G (prokaryotes) or eEF2 (eukaryotes), powered by GTP hydrolysis, translocates the tRNAs to the P and E sites, advancing the mRNA by one codon and ejecting the deacylated tRNA from the E site. This cycle repeats at an average rate of approximately 20 amino acids per second in prokaryotes under optimal conditions. Termination occurs when a (UAA, UAG, or UGA) enters the A site, lacking a corresponding tRNA. In prokaryotes, release factors RF1 (recognizing UAA/UAG) or RF2 (recognizing UAA/UGA) bind, mimicking tRNA structure to trigger hydrolysis of the ester bond linking the completed polypeptide to the P-site tRNA via the PTC, releasing the protein; RF3, a , then facilitates dissociation of RF1/RF2. Ribosome recycling follows, mediated by the ribosome recycling factor (RRF) and , which split the ribosomal subunits and release the mRNA for reuse in new events. The of is maintained through multiple mechanisms, achieving an error rate of about 10^{-4} incorrect per codon incorporated, primarily via initial selection accuracy, GTPase-activated during tRNA accommodation, and translocation . This low error rate ensures functional proteins despite the process's speed. Antibiotics like target by mimicking and prematurely terminating elongation through non-specific formation in the PTC.

Regulation of gene expression

Transcriptional regulation

Transcriptional regulation governs the initiation and rate of RNA synthesis from DNA templates, primarily through the coordinated action of cis-regulatory elements and factors that assemble at promoters. In both prokaryotes and eukaryotes, this process ensures precise control of gene expression in response to cellular needs, with core promoters serving as the primary sites for recruitment and distal enhancers providing additional regulatory input via long-range interactions. Promoters consist of core elements, such as the in eukaryotes or the -10 and -35 boxes in prokaryotes, which position the basal transcription machinery, while enhancers are distal DNA sequences that boost transcription when bound by specific factors. Enhancers can loop to promoters over distances up to megabases, facilitated by the architectural proteins and , which stabilize loops to bring enhancers into proximity with target genes. This looping mechanism enhances promoter activity by concentrating activators and co-factors at the transcription start site, as demonstrated in studies of developmental genes where CTCF-cohesin depletion disrupts enhancer-promoter contacts without fully abolishing transcription. Transcription factors (TFs) are proteins that bind DNA to modulate RNA polymerase activity, divided into general TFs required for basal transcription and specific TFs that confer regulatory specificity. General TFs, like TBP (TATA-binding protein), recognize core promoter motifs and recruit RNA polymerase II (Pol II) in eukaryotes, forming the pre-initiation complex essential for all Pol II-dependent genes. Specific TFs, such as p53, bind to cognate DNA sequences in enhancers or promoters to activate or repress target genes in response to signals like DNA damage; p53's transactivation domains interact with co-activators to stimulate transcription, while its repression domains can inhibit via interactions with general machinery components. These domains often mediate protein-protein contacts, enabling TFs to recruit or block the transcriptional apparatus. The complex acts as a central hub, bridging specific TFs bound at enhancers and promoters to the core Pol II machinery at promoters. Composed of over 20 subunits, integrates signals from diverse TFs, stabilizing the pre-initiation complex and phosphorylating Pol II's C-terminal domain to promote elongation. It collaborates with co-activators, including histone acetyltransferases like p300/CBP, which modify to facilitate access while coordinates the overall response. In prokaryotes, transcriptional regulation often occurs through operons, clusters of genes transcribed as a single mRNA under coordinated control. The exemplifies inducible regulation: in the absence of , the LacI binds the operator, blocking access; binding to LacI relieves repression, allowing transcription of genes for metabolism. The demonstrates repressible control: high levels activate the TrpR to bind the operator, halting synthesis of tryptophan biosynthetic enzymes; additionally, fine-tunes expression via a leader sequence in the mRNA, where stalling during low translates a terminator hairpin, preventing full operon transcription, whereas ample allows antiterminator formation for continued synthesis. These mechanisms highlight how prokaryotes achieve rapid, resource-efficient gene control without complex . Eukaryotic transcriptional regulation is more intricate, relying on combinatorial control where multiple TFs integrate signals at enhancers and promoters to dictate tissue-specific expression patterns. For instance, the myogenic factor , a basic helix-loop-helix TF, binds E-box motifs in muscle-specific enhancers to activate genes like those for contractile proteins, cooperating with other factors such as MEF2 to establish the program during development and regeneration. This combinatorial logic allows a limited repertoire of TFs to generate diverse outcomes, with MyoD's activity modulated by partnerships that enhance opening and Pol II recruitment in myoblasts but not other cell types. Recent advances reveal that many TFs drive transcriptional activation through biomolecular , forming liquid-like condensates via intrinsically disordered regions (IDRs). These IDR-driven condensates concentrate TFs, , and Pol II at super-enhancers, creating hubs that amplify signaling and enhance transcription efficiency, as shown for OCT4 and GCN4 where condensate formation correlates with activation potency. Post-2010 studies, including those on coactivator condensates at enhancers, underscore how provides a physical basis for selective activation, linking TF multivalency to organization. Epigenetic marks can influence accessibility to support these interactions.

Epigenetic modifications

Epigenetic modifications encompass heritable changes to DNA and chromatin that do not alter the underlying nucleotide sequence but profoundly influence gene expression patterns by modulating chromatin accessibility and transcriptional activity. These modifications include DNA methylation, histone tail alterations, chromatin remodeling, and the involvement of non-coding RNAs, all of which contribute to stable, long-term regulation of gene activity during development, cell differentiation, and response to environmental cues. Unlike sequence-specific transcriptional controls, epigenetic mechanisms often establish broad, heritable states of gene repression or activation that can persist across cell divisions. DNA methylation primarily occurs at the fifth carbon of cytosine residues (5-methylcytosine, or 5mC) within CpG dinucleotides, which are symmetrically distributed in promoter-proximal CpG islands of approximately 60% of genes. This modification is catalyzed by enzymes (DNMTs), including , which maintains methylation patterns during , and de novo methyltransferases DNMT3A and DNMT3B, which establish new methylation marks. Hypermethylation of CpG islands typically leads to by inhibiting binding and recruiting repressive complexes such as methyl-CpG-binding protein 2 (MeCP2), which in turn compact . A key example is , where parent-of-origin-specific silences one of imprinted genes like IGF2, ensuring monoallelic expression critical for embryonic development. Histone modifications involve covalent attachments to the N-terminal tails of proteins, altering structure and serving as docking sites for regulatory proteins. , mediated by histone acetyltransferases (HATs) such as p300/CBP, neutralizes positive charges on residues (e.g., H3K9ac, H3K27ac), promoting open () and facilitating transcriptional by recruiting co-activators. In contrast, by histone methyltransferases (HMTs) can either activate or repress genes depending on the site and degree; for instance, trimethylation of at 27 (H3K27me3), catalyzed by the subunit of Polycomb Repressive Complex 2 (PRC2), mediates transcriptional repression by compacting and blocking activator access. , often at serine or residues (e.g., H3S10ph), is associated with during but can also enhance transcription in by disrupting interactions. The code hypothesis posits that these modifications form combinatorial patterns that specify distinct states and recruit specific effector proteins to regulate gene expression. Chromatin remodeling complexes dynamically alter positioning to control DNA accessibility for transcription. The family, prototypical ATP-dependent remodelers, use the energy from to slide, eject, or restructure nucleosomes, thereby exposing or occluding promoter regions. In and mammals, complexes (e.g., BAF in humans) are recruited to enhancers and promoters, where they facilitate binding and counteract repressive modifications to activate gene expression during development and differentiation. Non-coding RNAs play a pivotal role in epigenetic silencing by guiding chromatin-modifying complexes to target loci. The long non-coding RNA exemplifies this in X-chromosome inactivation, where it coats the inactive in female mammals, recruiting PRC2 for deposition and DNMT3A for , resulting in stable repression of X-linked genes to achieve dosage compensation. DNA demethylation counteracts to reactivate genes, particularly during development and cellular . This process occurs via passive mechanisms, where failure of maintenance during replication dilutes 5mC over divisions, or active pathways involving ten-eleven translocation (TET) enzymes (TET1-3), which oxidize 5mC to (5hmC) and further to 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC), facilitating and removal. TET proteins are essential for pluripotency and lineage specification, as their impairs global demethylation waves in early embryos. Aberrant epigenetic modifications are implicated in diseases, notably cancer, where promoter hypermethylation silences tumor suppressor genes. For example, hypermethylation of the promoter in ovarian and breast cancers impairs and sensitizes tumors to poly(ADP-ribose) polymerase inhibitors, with recent studies confirming its prognostic value in patient stratification. Loss-of-function mutations in TET2, leading to hypermethylation, drive myeloid malignancies, prompting therapeutic exploration of TET modulators to restore demethylation and improve outcomes in .

Post-transcriptional regulation

Post-transcriptional regulation encompasses a suite of mechanisms that modulate gene expression after RNA transcription, primarily by influencing mRNA stability, processing, localization, and translation readiness, thereby fine-tuning protein output without altering transcription rates. These processes occur in the nucleus and , involving RNA-binding proteins (RBPs), non-coding RNAs, and enzymatic modifications that determine the fate of individual transcripts. By controlling RNA half-lives, which can vary from minutes to days depending on sequence elements and cellular context, post-transcriptional regulation enables rapid and tissue-specific responses to environmental cues, such as stress or developmental signals. For instance, mRNA half-lives span a wide range, with some transcripts degrading rapidly within minutes while others persist for days, reflecting a 1000-fold variation in deadenylation rates that directly impacts steady-state levels. A key aspect of mRNA stability involves cis-regulatory elements in the 3' untranslated region (UTR), such as AU-rich elements (), which promote rapid decay when bound by destabilizing factors. AREs, characterized by clusters of and uracil residues, trigger deadenylation—the progressive shortening of the poly(A) tail—and subsequent exonucleolytic degradation, often limiting the lifespan of transcripts encoding cytokines or proto-oncogenes to prevent excessive or proliferation. The poly(A)-specific (PARN) plays a central role in this process by catalyzing deadenylation, particularly for mRNAs with short poly(A) tails or those targeted for rapid turnover, thereby integrating stability control with translational efficiency. Alternative processing events, including and , further diversify post-transcriptional outcomes by generating tissue-specific mRNA isoforms from a single pre-mRNA. assembles variable combinations, producing isoforms with distinct stability, localization, or function; for example, in cancer, variable splicing of the yields isoforms that enhance and , with v6-containing variants overexpressed in breast and colorectal tumors to promote invasion. Similarly, alternative selects different polyadenylation sites, altering 3' UTR length and thereby modulating miRNA accessibility or RBP binding, which can shift isoform stability across tissues like liver versus brain. These mechanisms allow a single to yield dozens of functional variants, contributing to cellular diversity and disease pathology. RNA-binding proteins (RBPs) serve as versatile executors of post-transcriptional control, binding specific sequence motifs to either stabilize or destabilize mRNAs and influence their localization within the cell. The RBP HuR (human antigen R), for instance, binds in the 3' UTR to protect transcripts like those for growth factors from degradation, extending their half-life and promoting translation during proliferation or stress responses. In contrast, tristetraprolin (TTP) competes for the same , recruiting deadenylation complexes to accelerate mRNA decay, as seen in the rapid turnover of pro-inflammatory cytokines like TNF-α to resolve immune responses. Beyond stability, RBPs like zipcode-binding protein 1 facilitate mRNA localization to subcellular compartments, such as dendrites in neurons, ensuring localized protein synthesis for . Dysregulation of these RBPs, such as HuR overexpression in tumors, can tip the balance toward pathological expression profiles. MicroRNAs (miRNAs) represent a major class of post-transcriptional regulators, with over 60% of human protein-coding genes harboring conserved binding sites that fine-tune expression by repressing translation or promoting decay. miRNA biogenesis begins with transcription of primary miRNAs (pri-miRNAs), which are processed in the nucleus by and DGCR8 into precursor hairpins (pre-miRNAs), followed by cytoplasmic cleavage by to yield mature ~22- duplexes; one strand then loads into the (AGO) protein within the (RISC) to guide targeting. miRNAs typically bind the 3' UTR of target mRNAs via partial base-pairing, with the 2-8 "seed" providing specificity; this interaction recruits deadenylation machinery or blocks ribosomal scanning, reducing protein levels by up to 50-70% for most targets. In development and , miRNAs like miR-21 stabilize oncogenic networks by downregulating tumor suppressors, highlighting their broad regulatory scope. RNA , particularly adenosine-to-inosine (A-to-I) by enzymes, introduces sequence changes that alter mRNA stability, splicing, or coding potential post-transcriptionally. 1 and 2 catalyze A-to-I conversions, read as during , primarily in the where recodes ~2-3% of adenosines in neuronal transcripts; for example, of glutamate subunits like GluA2 modulates calcium permeability, preventing . In , aberrant links to disorders such as and amyotrophic lateral sclerosis (), where reduced 2 activity destabilizes transcripts or generates toxic isoforms, underscoring 's role in neuronal . These modifications expand the without genomic changes, with implications for neurodevelopment and disease resilience. Long non-coding RNAs (lncRNAs) also contribute to , often acting in cis to modulate nearby mRNA processing, stability, or localization through direct base-pairing or RBP recruitment. For instance, the lncRNA HOTAIR, transcribed from the HOXC locus, influences post-transcriptional events by interacting with protein complexes that affect splicing or decay of metastasis-associated genes, with elevated levels in promoting isoform shifts that enhance invasiveness. Post-2015 studies have expanded understanding of lncRNA cis-actions, revealing mechanisms like Xist-mediated silencing of X-chromosome genes via localized mRNA stabilization control, integrating lncRNAs into dynamic regulatory networks. These elements highlight lncRNAs' emerging role in fine-tuning expression beyond transcriptional control.

Translational and post-translational regulation

Translational regulation controls the efficiency and specificity of protein synthesis from mature mRNA, primarily at the initiation stage where ribosomes assemble on the mRNA. One key mechanism involves the phosphorylation of eukaryotic initiation factor 2 (eIF2), which inhibits global translation during cellular stress such as the unfolded protein response; for instance, PERK kinase phosphorylates eIF2α to reduce ternary complex formation, thereby attenuating initiation while selectively allowing translation of stress-response genes like ATF4. Internal ribosome entry sites (IRES) provide an alternative cap-independent initiation pathway, enabling translation under conditions where cap-dependent scanning is impaired, as seen in viral mRNAs and certain cellular transcripts like those encoding HIF-1α during hypoxia. Upstream open reading frames (uORFs) in the 5' untranslated region (UTR) of mRNAs often repress translation by sequestering ribosomes or triggering abortion of the main ORF, with polymorphic uORFs contributing to inter-individual variation in protein expression levels. Ribosome profiling, introduced in 2009, has revolutionized the study of by sequencing ribosome-protected mRNA fragments, revealing translation rates, pausing events, and the impact of regulatory elements at resolution across the . This technique has shown, for example, that uORFs and IRES elements modulate efficiency in response to environmental cues, providing quantitative insights into how stress or nutrients alter composition without changing mRNA levels. MicroRNAs (miRNAs), while primarily acting post-transcriptionally on mRNA stability, can also repress by interfering with or elongation once ribosomes engage the mRNA. Post-translational regulation fine-tunes protein function, localization, and degradation after synthesis, often through covalent modifications that respond to cellular signals. , catalyzed by kinases such as (PKA) in response to cAMP signaling, adds groups to serine, , or residues, thereby activating or inactivating enzymes like in metabolic pathways. Ubiquitination involves the attachment of chains by E3 ligases, marking proteins for degradation via the 26S proteasome and controlling processes like progression; for example, the E3 ligase ubiquitinates , reducing its half-life to approximately 20 minutes under normal conditions to prevent excessive . attaches carbohydrate moieties in the and Golgi, influencing , stability, and trafficking, as exemplified by N-linked glycans on antibodies that enhance immune effector functions. Protein stability is a critical aspect of post-translational control, with the ubiquitin-proteasome pathway degrading short-lived regulatory proteins to maintain ; proteins like cyclins exhibit half-lives of minutes to hours, allowing rapid responses to signals. Feedback loops integrate these modifications with upstream signals, such as the pathway, which senses nutrients and growth factors to phosphorylate targets like 4E-BP1, thereby promoting cap-dependent translation initiation and balancing anabolic processes. SUMOylation, involving the small ubiquitin-like modifier (), conjugates to residues to regulate protein interactions and stress responses, with recent cryo-electron structures (post-2020) elucidating the SUMO E1-activating enzyme's mechanism in conjugating SUMO under oxidative or heat stress, thereby stabilizing transcription factors like HIF-1.

Measurement and quantification

mRNA analysis techniques

mRNA analysis techniques encompass a range of methods designed to detect, quantify, and profile transcripts, providing insights into gene expression levels and patterns. These approaches have evolved from low-throughput hybridization-based assays to high-throughput sequencing technologies, enabling genome-wide analysis with increasing resolution and sensitivity. Traditional methods like Northern blotting offer specificity for individual transcripts, while modern techniques such as (RNA-seq) allow for comprehensive profiling, including detection of and low-abundance RNAs. Northern blotting, a classical hybridization technique, separates RNA molecules by size using denaturing , transfers them to a membrane, and detects specific mRNAs via hybridization with labeled complementary probes, such as radioactive or fluorescent DNA/RNA oligos. Developed in , this method confirms transcript size, abundance, and integrity while distinguishing mature mRNAs from precursors, but it is labor-intensive, requires substantial RNA input (typically 10-30 μg), and lacks high throughput, limiting its use to validation of candidate genes rather than broad profiling. Reverse transcription quantitative polymerase chain reaction (RT-qPCR) amplifies and quantifies specific mRNA targets after converting RNA to complementary DNA (cDNA) using reverse transcriptase enzymes. Detection relies on fluorescent dyes like SYBR Green, which intercalates with double-stranded DNA, or probe-based systems such as TaqMan, where hydrolysis of a fluorophore-quencher-labeled probe during amplification generates a signal proportional to product accumulation. Quantification uses the cycle threshold (Ct) value—the PCR cycle at which fluorescence exceeds background—allowing relative expression calculation via the ΔΔCt method, normalized to stable reference genes like GAPDH to account for input variations; absolute quantification can employ standard curves. This technique offers high sensitivity (detecting femtogram levels of RNA) and specificity but is limited to predefined targets and prone to biases from reverse transcription efficiency. Microarray hybridization platforms enable parallel analysis of thousands of transcripts by immobilizing oligonucleotide probes (short DNA sequences, 25-70 nucleotides) on a solid surface, such as glass slides, where labeled cDNA or cRNA from the sample hybridizes to complementary probes, and signal intensity reflects expression levels. Pioneered in 1995, these arrays quantify gene expression through fluorescence scanning, with data normalized for background and technical variability; Affymetrix GeneChips use high-density probe pairs (perfect match and mismatch) on silicon wafers for mismatch discrimination, while Agilent arrays employ inkjet-printed long oligos on glass for higher specificity and dynamic range. Microarrays provide cost-effective genome-wide snapshots but suffer from cross-hybridization, limited dynamic range (3-4 orders of magnitude), and inability to detect novel transcripts or isoforms. RNA sequencing (RNA-seq) has revolutionized mRNA analysis by using next-generation sequencing platforms, such as Illumina's short-read technology, to generate millions of cDNA fragments for high-throughput sequencing. The workflow involves RNA isolation, fragmentation, cDNA synthesis, adapter ligation, amplification, and sequencing, followed by read alignment to a using tools like or HISAT2, and quantification of transcript abundance via metrics like fragments per kilobase of transcript per million mapped reads (FPKM) or transcripts per million (TPM), which normalize for length, sequencing depth, and composition biases. Introduced in 2008 for mammalian , RNA-seq offers unbiased detection of all expressed , including low-abundance and novel transcripts, with a exceeding six orders of magnitude and single-base resolution for splice junctions. Single-cell RNA-seq (scRNA-seq) variants, such as Drop-seq developed in 2015, encapsulate individual cells in nanoliter droplets with barcoded beads to profile thousands of cells simultaneously, revealing cellular heterogeneity but introducing challenges like dropout events and sparsity in data.00549-8) For detecting and quantifying mRNA isoforms arising from , long-read sequencing technologies like (PacBio) and (ONT) provide full-length transcript reads spanning entire molecules (up to 10-20 kb), bypassing the fragmentation issues of short-read . These methods sequence native or amplified /cDNA directly, enabling accurate isoform assembly and quantification without reliance on computational reconstruction, as demonstrated in comprehensive studies since 2013 for PacBio Iso-Seq and 2017 for ONT native sequencing. Long-read approaches resolve complex splicing patterns and novel isoforms in 20-50% more transcripts than short-read methods, though they currently offer lower throughput and higher error rates (~0.1% for PacBio HiFi reads and ~0.5-2% for ONT with consensus calling, as of 2025), requiring error correction and hybrid short-long read strategies for optimal accuracy. To address limitations in , techniques map mRNA distribution within tissue sections, preserving positional information often lost in dissociated samples. Methods like Visium, launched in 2019, array barcoded capture probes on slides to hybridize poly-A tails from permeabilized tissue slices, followed by reverse transcription, sequencing, and image alignment to generate spatially resolved expression maps at near-cellular resolution (55 μm spots covering 1-10 cells). Recent advancements like Visium HD, launched in 2024, achieve 2 μm subcellular resolution for single-cell-scale profiling. Building on earlier array-based approaches from 2016, these enable profiling of thousands of genes across tissue architecture, revealing microenvironmental gradients, but current implementations provide averaging over spots and incomplete coverage of non-polyadenylated RNAs. Such data complements bulk or single-cell analyses by integrating expression with , aiding studies of development and disease.

Protein analysis techniques

Protein analysis techniques are essential for assessing the functional outcomes of gene expression, as they enable the detection, quantification, and characterization of translated proteins, including post-translational modifications (PTMs) that influence activity and localization. Unlike mRNA-based methods, these approaches directly measure the end products of gene expression, providing insights into protein abundance, interactions, and functionality in cellular contexts. Common techniques leverage immunological detection, chromatographic separation, or enzymatic reporting to achieve high specificity and sensitivity, often applied in studies of , development, and . Western blotting is a widely used for detecting specific proteins in complex samples. The technique involves separating proteins by size using , followed by transfer to a or PVDF membrane, and probing with primary antibodies specific to the target protein, visualized via secondary antibody-linked enzymes or fluorophores. Developed in , it allows semi-quantitative analysis through , where band intensity correlates with protein levels, though normalization to loading controls like or GAPDH is required for accuracy. Western blotting is particularly valuable for confirming protein expression from genes of interest and detecting PTMs such as , with detection limits typically in the nanogram range per lane. Enzyme-linked immunosorbent assay (ELISA) provides a sensitive method for quantifying proteins, especially secreted or soluble forms, in biological fluids. In the sandwich ELISA format, a capture immobilizes the target on a well, followed by detection with a second enzyme-conjugated antibody, producing a colorimetric, fluorescent, or chemiluminescent signal proportional to protein concentration. Introduced in 1971, this technique achieves sensitivities as low as ~1 pg/mL for many analytes, making it ideal for low-abundance proteins like cytokines or hormones. ELISAs are high-throughput and quantitative, often used to measure gene expression outputs in serum or supernatants, with variations like competitive ELISA for small molecules. Mass spectrometry (MS), particularly liquid chromatography-tandem MS (LC-MS/MS), enables comprehensive by identifying and quantifying thousands of proteins simultaneously from complex mixtures. In , proteins are digested into peptides, separated by LC, ionized, and fragmented for via MS/MS, allowing proteome-wide profiling. Quantification can be label-free, relying on spectral counting or intensity, or use stable isotope labeling like SILAC (stable isotope labeling by in ), where cells are grown in media with heavy isotopes to compare relative abundances with high precision (ratios accurate to <10% error). Introduced in 2002, SILAC is compatible with MS for dynamic studies of gene expression changes. LC-MS/MS excels in PTM identification, such as ubiquitination or glycosylation sites, with recent advancements achieving up to ~5,000 proteins per cell in optimized single-cell workflows, as of 2025. Flow cytometry facilitates high-throughput analysis of protein expression at the single-cell level, including intracellular targets. Cells are fixed, permeabilized, and stained with fluorescently labeled antibodies specific to the protein of interest, then passed through a laser-interrogated flow stream to measure fluorescence intensity, enabling quantification of expression levels and heterogeneity. Multiplexing with multiple antibodies (up to 40+ colors) allows simultaneous assessment of several proteins, such as transcription factors or signaling molecules, in populations like immune cells. This technique is particularly useful for monitoring gene expression dynamics in response to stimuli, with sensitivities down to ~1,000 molecules per cell, and supports sorting of expressing cells for downstream analysis. Reporter assays offer real-time, non-invasive monitoring of protein expression by fusing the gene of interest to a reporter like or . In GFP fusions, the fluorescent tag allows visualization and quantification via microscopy or flow cytometry, reflecting the spatiotemporal dynamics of the target protein. Pioneered in 1994, GFP reporters are genetically encoded and require no substrates, enabling live-cell imaging of expression in organisms from bacteria to mammals. Luciferase reporters, based on firefly or Renilla enzymes, produce bioluminescent signals upon substrate addition, offering high sensitivity (~10^2-10^3 molecules) for transient or stable transfections, commonly used to quantify promoter activity driving gene expression. Recent advances include proximity labeling techniques like BioID, which uses a promiscuous biotin ligase fused to a bait protein to biotinylate nearby proteins in living cells, enabling identification of interactomes and transient associations via MS. Developed in 2012, BioID captures proteins within ~10 nm, complementing traditional co-immunoprecipitation by labeling under physiological conditions. Additionally, AI-enhanced MS has emerged post-2023, with machine learning models improving peptide identification accuracy by >20% through spectral prediction and , accelerating in large-scale gene expression studies. These innovations enhance the resolution of protein-level insights into gene regulation.

Correlation and integration methods

Studies of gene expression have consistently shown that mRNA abundance correlates moderately with protein levels, with Spearman coefficients typically ranging from 0.4 to 0.6 across large-scale datasets in and cells. This discrepancy arises primarily from variations in translation efficiency, influenced by factors such as and availability, as well as differences in mRNA and protein degradation rates. For instance, mRNAs with optimal codons are translated more efficiently, leading to higher protein output relative to transcript levels, while unstable proteins degrade rapidly, decoupling steady-state protein abundance from mRNA levels. To address these discrepancies, multi-omics integration methods combine transcriptomic and proteomic data for a more comprehensive view of gene expression. Ribosome profiling (Ribo-seq), which maps ribosome-protected mRNA fragments to quantify , is often paired with to estimate translation efficiency by calculating ribosome density on transcripts. Similarly, integrating Ribo-seq with mass spectrometry-based enables the identification of translated open reading frames and improves proteome annotation through proteogenomics approaches. These methods reveal that , such as alternative translation initiation, contributes significantly to the observed mRNA-protein mismatches. Mathematical modeling provides a framework for understanding these dynamics at , where protein concentration [P][P] is determined by the balance of synthesis and degradation rates: [P]=ks[mRNA]kd[P] = \frac{k_s \cdot [mRNA]}{k_d} Here, ksk_s represents the rate (synthesis rate per mRNA molecule), and kdk_d is the protein degradation rate constant. This equation highlights how variations in ksk_s and kdk_d can buffer or amplify mRNA fluctuations to maintain stable protein levels, with empirical studies showing that degradation half-lives span orders of magnitude across proteins. At the single-cell level, correlations between mRNA and protein levels are even weaker due to stochastic noise in gene expression, often exacerbated by transcriptional bursting and variable translation. Techniques like single-cell RNA sequencing (scRNA-seq) integrated with (CyTOF) allow simultaneous measurement of transcriptomes and dozens of protein markers, revealing cell-to-cell heterogeneity where noise from low molecule counts dominates. For example, data shows that protein levels in immune cells correlate modestly (Spearman ~0.3-0.5) with scRNA-seq-derived mRNA estimates, underscoring the role of intrinsic stochasticity in expression variability. Buffering mechanisms further explain the imperfect correlation by stabilizing protein levels against perturbations in mRNA abundance. MicroRNAs (miRNAs) play a key role through loops, where they bind target mRNAs to repress and promote degradation, thereby reducing noise and constraining expression variance. This miRNA-mediated buffering is particularly evident in developmental contexts, where it maintains robust protein despite fluctuating transcript levels. Emerging approaches aim to predict these correlations by modeling regulatory impacts on expression. For instance, AlphaFold3 (2024) enables accurate prediction of protein-nucleic acid interactions, which can inform how structural features influence translation efficiency and mRNA stability. Such tools, combined with on multi-omics data, hold promise for imputing missing protein levels from transcriptomic profiles, though current models remain limited by training data sparsity.

Applications and systems

Expression systems in biotechnology

Expression systems in refer to engineered platforms designed to produce specific proteins or molecules at high levels in host organisms or cell-free environments, enabling applications in , therapeutics, and industrial production. These systems leverage promoters, regulatory elements, and vectors to control gene expression, often mimicking or enhancing natural mechanisms for precise temporal and spatial regulation. By optimizing codon usage, chaperone co-expression, and culture conditions, yields can reach grams per liter, facilitating scalable manufacturing. Prokaryotic expression systems, particularly in , are widely used due to their rapid growth, low cost, and ease of genetic manipulation. The T7 RNA polymerase-based system, developed using pET vectors, drives high-level expression from the strong T7 promoter upon induction with IPTG, achieving protein yields up to 50% of total cellular protein in optimized strains like BL21(DE3). Complementing this, the IPTG-inducible system allows tunable expression via the lac promoter, where allolactose analog IPTG relieves LacI binding, enabling fine control for toxic proteins. Eukaryotic systems provide post-translational modifications essential for mammalian proteins. In yeast like , the GAL1 promoter is induced by and repressed by glucose, supporting secreted at levels of 1-10 g/L in strains engineered for . For mammalian expression, the cytomegalovirus (CMV) promoter in human embryonic kidney (HEK293) cells drives constitutive high-level transcription, often yielding 100-500 mg/L of glycosylated antibodies via transient . Inducible systems offer reversible control to mitigate toxicity. The Tet-on and Tet-off systems use to modulate a tetracycline transactivator (tTA or rtTA), enabling activation or repression of target genes with minimal leakiness in mammalian cells. Light-inducible systems, incorporating light-oxygen-voltage (LOV) domains from proteins like VVD, allow optogenetic control of expression through blue light-triggered dimerization, achieving fold-induction ratios over 100 in and eukaryotes. Viral vectors facilitate or in hard-to-transfect cells. Adeno-associated virus (AAV) vectors provide long-term episomal expression with low , commonly used for at doses delivering 10^12-10^14 vector genomes per kg. Lentiviral vectors integrate transgenes for expression in dividing and non-dividing cells, supporting titers up to 10^8 TU/mL for applications like CAR-T cell engineering. For activation without genomic integration, CRISPR-based systems fuse catalytically dead Cas9 (dCas9) to VP64 activators, upregulating endogenous genes by 10-100 fold upon targeting. Synthetic biology constructs enable complex circuits. Toggle switches, bistable networks using mutual repression (e.g., lacI and promoters), maintain two stable expression states switchable by inducers, with response times under 1 hour in E. coli. Oscillators like the repressilator, a ring of three repressor genes (, , cI), generate rhythmic expression with periods of 2-3 hours, demonstrating predictable dynamics . These systems underpin recombinant , such as human insulin expressed in E. coli using the lac promoter, which revolutionized treatment with over 99% market share since 1982. Cell-free systems, like transcription-translation (TXTL) extracts from E. coli, support without cellular constraints, with post-2018 optimizations incorporating energy regeneration and chaperones boosting yields to 1-2 mg/mL for model proteins.

Gene expression in disease and development

Gene expression plays a pivotal role in embryonic development, where spatial and temporal gradients of regulatory proteins establish body axes and segment patterns. In , the Bicoid protein forms an anterior-to-posterior concentration gradient that acts as a , activating target genes such as hunchback in a threshold-dependent manner to specify anterior structures like the head and . This gradient is established by localized maternal mRNA deposition at the anterior pole, followed by translation and diffusion in the syncytial embryo. Similarly, vertebrate somitogenesis involves a segmentation clock, an oscillatory genetic network driven by cyclic expression of genes like hairy and enhancer of split (Hes) family members, which regulates the periodic formation of somites along the body axis. These oscillations, with periods of about 2 hours in mice, arise from loops involving Notch, Wnt, and FGF signaling pathways, ensuring synchronized tissue segmentation. Dysregulated gene expression underlies many diseases, particularly cancer, where aberrant activation of s and silencing of tumor suppressors drive hallmarks such as sustained proliferative signaling. Amplification of the , observed in approximately 28% of tumors across various cancers including and , leads to overexpression that enhances transcription of genes promoting , , and . Conversely, epigenetic silencing of tumor suppressor genes like p16INK4a and MLH1 through promoter hypermethylation occurs frequently in colorectal and other cancers, inactivating pathways that normally halt uncontrolled proliferation. In neurological contexts, the transcription factor CREB mediates activity-dependent gene expression critical for learning, , and ; phosphorylation of CREB by kinases like PKA in response to neuronal stimulation activates downstream targets such as BDNF and c-fos, strengthening synaptic connections in the hippocampus. Beyond cancer and , altered gene expression profiles characterize other diseases, including autoimmune disorders and infectious conditions. In systemic (SLE), an (IFN) signature—marked by upregulated expression of over 100 IFN-stimulated genes in peripheral blood mononuclear cells—observed in approximately half of patients and correlates with disease activity and production. Single-cell sequencing studies of patients have revealed disease severity-specific expression changes, such as heightened IFN responses and dysregulation in severe cases, highlighting heterogeneous immune cell states that persist post-infection. Therapeutic strategies targeting these dysregulations include (HDAC) inhibitors like , which reverse epigenetic silencing of tumor suppressors in cancers such as by promoting and gene reactivation. (RNAi) therapeutics, exemplified by —an siRNA that silences (TTR) gene expression—have shown efficacy in hereditary ATTR amyloidosis by reducing toxic protein aggregates and improving neuropathy symptoms in phase III trials. Gene expression divergence also contributes to evolutionary processes, particularly , where changes in regulatory elements lead to species-specific patterns without altering protein-coding sequences. In closely related species like , cis-regulatory mutations drive differential expression of genes such as BMP4 in beak development, facilitating adaptive morphological divergence and . Such expression shifts, often involving trans-regulatory factors, accumulate over time and can result in hybrid incompatibilities, underscoring the role of regulatory evolution in generating .

Gene regulatory networks

Gene regulatory networks (GRNs) consist of interconnected genes and their regulatory elements that collectively control the timing, location, and level of gene expression in response to internal and external signals. These networks integrate transcriptional regulators, such as transcription factors, with cis-regulatory modules to orchestrate complex cellular behaviors, from differentiation to . GRNs exhibit modular architectures that enable robustness and evolvability, allowing cells to process information akin to computational circuits. Common , or recurring subgraphs, underpin the functional logic of GRNs. Feed-forward loops (FFLs), for instance, involve a regulator that controls both a direct target and an intermediary regulator of the same target, enabling rapid signal propagation or delay in response. In , FFLs are overrepresented and function in noise filtering and response acceleration. Feedback loops provide stability or amplification; dampens oscillations to maintain steady states, while reinforces commitments, such as in cell fate decisions. The exemplifies a classic motif, where the forms a loop with lactose-induced activation, ensuring efficient in response to environmental sugars. Reconstructing GRNs from high-throughput data, such as gene expression profiles, reveals these interactions. The ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) algorithm infers direct regulatory links by estimating between genes and pruning indirect connections via , scaling to mammalian genome-wide networks. Boolean networks model GRN dynamics by assigning binary states (on/off) to genes and defining logical rules for activation, capturing and attractors that represent stable cell states. These discrete models have been applied to simulate developmental transitions and predict perturbation outcomes. GRNs often display scale-free topologies, where a few highly connected hub genes regulate many targets, following a power-law degree distribution. The tumor suppressor TP53 exemplifies a hub, integrating stress signals to activate hundreds of downstream genes involved in and arrest, conferring network robustness against random node failures. This scale-free property enhances resilience to perturbations, as hubs maintain core functionality even under genetic or environmental stress. In development, GRNs coordinate spatial and temporal gene expression patterns. The Strongylocentrotus purpuratus endomesoderm GRN, modeled by Davidson in , illustrates a hierarchical circuit where upstream inputs like β-catenin activate territorial transcription factors, leading to progressive specification of and lineages through repressive and activating interactions. This model highlights how GRNs kernel functions—small subcircuits—drive irreversible cell fate decisions. In , dysregulated GRNs contribute to , particularly in cancer. The forms a core GRN module in , where APC mutations stabilize β-catenin, driving aberrant activation of and CCND1 to promote proliferation and . Post-2022 single-cell atlas projects, such as the Human Cell Atlas, have mapped heterogeneous cell states but reveal incomplete GRN coverage due to challenges in inferring context-specific regulations from sparse single-cell data. GRN dynamics often produce oscillatory expression patterns essential for periodic processes. The mammalian circadian clock GRN features interlocking feedback loops: CLOCK-BMAL1 activates PER and CRY transcription, whose protein products form repressive complexes that inhibit CLOCK-BMAL1, generating ~24-hour rhythms in gene expression across tissues. This oscillatory architecture ensures synchronized , with disruptions linked to metabolic disorders.

Techniques and resources

Experimental tools

Experimental tools for studying gene expression encompass a range of techniques designed to visualize, perturb, and analyze regulatory mechanisms at the molecular level. These methods enable researchers to monitor promoter activity, disrupt gene function, map protein-DNA interactions, and observe dynamic processes in living cells, providing insights into transcriptional control and regulatory networks. Unlike quantification-focused approaches, these tools emphasize functional interrogation and spatial-temporal visualization. Reporter genes serve as versatile tools for visualizing and quantifying gene expression patterns and . The lacZ gene, encoding from , is a classic reporter that produces a blue precipitate upon reaction with substrate, allowing histological detection of expression in transgenic models such as mice. This system has been widely adopted for mapping developmental expression profiles due to its stability and ease of detection. (GFP), derived from the jellyfish , enables non-invasive, real-time visualization of gene expression through its intrinsic fluorescence without requiring substrates or cofactors. Introduced as a marker in , GFP and its variants have revolutionized live-cell imaging by facilitating the tracking of protein localization and expression dynamics in organisms from to mammals. Dual-luciferase assays enhance reporter precision by co-transfecting (driven by the promoter of interest) with Renilla luciferase (as an internal control for transfection efficiency), allowing normalized measurement of transcriptional activity through sequential detection. This method, developed in the mid-1990s, minimizes variability from cell number or viability, making it ideal for of regulatory elements. Perturbation techniques are essential for dissecting causal relationships in gene expression by selectively inhibiting or knocking down target genes. RNA interference (RNAi) utilizes small interfering RNAs (siRNAs) or short hairpin RNAs (shRNAs) to trigger sequence-specific mRNA degradation, effectively silencing gene expression. The discovery of RNAi in 1998 demonstrated its potency in , and shRNA expression vectors extended this to stable knockdown in mammalian cells by mimicking pri-miRNA processing. interference () employs a catalytically dead (dCas9) protein guided by single-guide RNAs (sgRNAs) to sterically block transcription initiation or elongation without altering the genome. Introduced in 2013, achieves tunable repression levels and multiplexed targeting, offering reversibility and minimal off-target effects compared to traditional knockouts. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a key method for mapping (TF) binding sites and epigenetic modifications genome-wide. The technique involves crosslinking proteins to , immunoprecipitating with antibodies specific to the TF or histone mark, and sequencing the enriched fragments to identify binding peaks. Pioneered in 2007, ChIP-seq provides high-resolution profiles of regulatory landscapes, revealing how TFs like or histone acetyltransferases influence expression. Peak calling algorithms then distinguish significant enrichment from background, enabling the annotation of enhancers and promoters associated with active transcription. The (EMSA) detects direct protein-DNA interactions by observing the retarded migration of labeled DNA probes bound to nuclear extracts during native . Developed in 1981, EMSA confirms TF binding to specific motifs, such as to κB sites, and can be supershifted with antibodies for specificity. This low-throughput assay remains valuable for validating interactions identified by high-throughput methods like ChIP-seq. Live-cell imaging techniques capture the spatiotemporal dynamics of gene expression. (FRET) uses pairs of fluorescent proteins, such as CFP and YFP fused to interacting partners, to report conformational changes or complex formation upon energy transfer from donor to acceptor emission. In gene expression studies, FRET-based reporters monitor promoter activation or TF dimerization in real time, providing kinetic data on regulatory events. extends this by employing light-sensitive proteins like or cryptochromes to control gene expression with high precision. Post-2015 applications in neural systems have utilized optogenetic tools to modulate transcription in neurons, such as light-inducible for doxycycline-independent control, aiding studies of circuit-specific expression in development and plasticity. Safety and ethical considerations are paramount when employing these tools, particularly with recombinant expression systems. Biosafety levels (BSL) classify laboratory practices based on risk: BSL-1 for well-characterized agents like non-pathogenic E. coli used in reporter assays, escalating to BSL-2 for moderate-risk materials involving viral vectors or human-derived cells in RNAi/CRISPR experiments. The CDC's Biosafety in Microbiological and Biomedical Laboratories guidelines mandate containment, , and decontamination protocols to prevent accidental release, while NIH guidelines for research ensure ethical oversight for gene perturbation studies.

Computational and database resources

Several major public databases serve as central repositories for gene expression data, enabling researchers to access, share, and analyze large-scale datasets. The (GEO), maintained by the (NCBI), is a primary archive for data, including and high-throughput sequencing experiments on mRNA, , and protein expression across . As of 2025, GEO hosts over 8 million samples from more than 260,000 studies, facilitating meta-analyses and validation of expression patterns. The () project provides comprehensive data on gene expression and epigenetic regulation in human cells, integrating , ChIP-seq, and other assays to map regulatory elements and their impact on transcription. 's datasets, spanning thousands of experiments, emphasize functional annotation of the non-coding genome and are accessible via its data portal for querying expression in specific cell types or conditions. Complementing these, the Genotype-Tissue Expression (GTEx) project offers tissue-specific gene expression profiles from 946 postmortem donors across 54 human tissues, linking genetic variants to (eQTLs) to study regulatory mechanisms. GTEx data, version 8 (released 2019), supports investigations into heritability and disease-associated expression variation. Analysis tools and algorithms are essential for processing and interpreting gene expression data from these repositories. DESeq2, an R-based package, is widely used for differential expression analysis of count data from RNA-seq experiments, employing a negative binomial model to estimate variance and detect significant changes between conditions while controlling for false discovery rates. Introduced in 2014, it has been cited over 20,000 times and remains a standard for robust in bulk and single-cell . For exploring functional relationships, the database integrates protein-protein interaction (PPI) networks with gene expression data, combining experimental, computational, and literature-derived evidence to predict co-expression and pathway involvement. STRING's latest version (12.5, 2025) covers over 12,000 organisms and includes tools for network visualization and enrichment analysis, aiding in the contextualization of expression changes within biological pathways. Prediction models leveraging have advanced the forecasting and annotation of gene expression patterns. DeepSEA, a deep , predicts (TF) binding sites and accessibility from DNA sequences, enabling the interpretation of non-coding variants' effects on expression regulation. Developed in 2015, DeepSEA was trained on data and achieves high accuracy in variant effect scoring, with applications in prioritizing disease-associated mutations. For expression forecasting, models like those based on graph neural networks or time-series analysis predict dynamic changes in gene expression under perturbations, such as in developmental trajectories or drug responses, using historical data from GEO or GTEx. In single-cell contexts, scGPT (2023), a pretrained on millions of cells, generates and analyzes expression profiles for tasks like annotation and perturbation simulation. Integration platforms facilitate the visualization and cross-referencing of gene expression data with genomic annotations. The provides an interactive interface for viewing expression tracks alongside reference genomes, allowing users to overlay data from or GTEx with epigenetic marks and variants. It supports custom uploads and API access, making it indispensable for and hypothesis generation. Data standards like MIAME (Minimum Information About a Microarray Experiment), extended to sequencing data, ensure that deposited datasets include sufficient metadata for , such as experimental design and processing details. However, challenges in persist, including batch effects, incomplete metadata, and variability in analysis pipelines, which can lead to inconsistent findings across studies despite standardized submissions to GEO. Regarding accessibility, most resources like DESeq2 and are open-source, promoting widespread adoption, whereas proprietary platforms such as Illumina BaseSpace offer integrated workflows for sequencing with commercial hardware compatibility, though they may limit customization. This dichotomy highlights ongoing efforts to balance innovation with equitable access in gene expression research.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.