Hubbry Logo
Protein structureProtein structureMain
Open search
Protein structure
Community hub
Protein structure
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Protein structure
Protein structure
from Wikipedia

Protein primary structureProtein secondary structureProtein tertiary structureProtein quaternary structure
The image above contains clickable links
The image above contains clickable links
This diagram (which is interactive) of protein structure uses PCNA as an example. (PDB: 1AXC​)

Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers – specifically polypeptides – formed from sequences of amino acids, which are the monomers of the polymer. A single amino acid monomer may also be called a residue, which indicates a repeating unit of a polymer. Proteins form by amino acids undergoing condensation reactions, in which the amino acids lose one water molecule per reaction in order to attach to one another with a peptide bond. By convention, a chain under 30 amino acids is often identified as a peptide, rather than a protein.[1] To be able to perform their biological function, proteins fold into one or more specific spatial conformations driven by a number of non-covalent interactions, such as hydrogen bonding, ionic interactions, Van der Waals forces, and hydrophobic packing. To understand the functions of proteins at a molecular level, it is often necessary to determine their three-dimensional structure. This is the topic of the scientific field of structural biology, which employs techniques such as X-ray crystallography, NMR spectroscopy, cryo-electron microscopy (cryo-EM) and dual polarisation interferometry, to determine the structure of proteins.

Protein structures range in size from tens to several thousand amino acids.[2] By physical size, proteins are classified as nanoparticles, between 1–100 nm. Very large protein complexes can be formed from protein subunits. For example, many thousands of actin molecules assemble into a microfilament.

A protein usually undergoes reversible structural changes in performing its biological function. The alternative structures of the same protein are referred to as different conformations, and transitions between them are called conformational changes.

Levels of protein structure

[edit]

There are four distinct levels of protein structure.

Four levels of protein structure

Primary structure

[edit]

The primary structure of a protein refers to the sequence of amino acids in the polypeptide chain. The primary structure is held together by peptide bonds that are made during the process of protein biosynthesis. The two ends of the polypeptide chain are referred to as the carboxyl terminus (C-terminus) and the amino terminus (N-terminus) based on the nature of the free group on each extremity. Counting of residues always starts at the N-terminal end (NH2-group), which is the end where the amino group is not involved in a peptide bond. The primary structure of a protein is determined by the gene corresponding to the protein. A specific sequence of nucleotides in DNA is transcribed into mRNA, which is read by the ribosome in a process called translation. The sequence of amino acids in insulin was discovered by Frederick Sanger, establishing that proteins have defining amino acid sequences.[3][4] The sequence of a protein is unique to that protein, and defines the structure and function of the protein. The sequence of a protein can be determined by methods such as Edman degradation or tandem mass spectrometry. Often, however, it is read directly from the sequence of the gene using the genetic code. It is strictly recommended to use the words "amino acid residues" when discussing proteins because when a peptide bond is formed, a water molecule is lost, and therefore proteins are made up of amino acid residues. Post-translational modifications such as phosphorylations and glycosylations are usually also considered a part of the primary structure, and cannot be read from the gene. For example, insulin is composed of 51 amino acids in 2 chains. One chain has 31 amino acids, and the other has 20 amino acids.

Secondary structure

[edit]
An α-helix with hydrogen bonds (yellow dots)

Secondary structure refers to highly regular local sub-structures on the actual polypeptide backbone chain. Two main types of secondary structure, the α-helix and the β-strand or β-sheets, were suggested in 1951 by Linus Pauling.[5] These secondary structures are defined by patterns of hydrogen bonds between the main-chain peptide groups. They have a regular geometry, being constrained to specific values of the dihedral angles ψ and φ on the Ramachandran plot. Both the α-helix and the β-sheet represent a way of saturating all the hydrogen bond donors and acceptors in the peptide backbone. Some parts of the protein are ordered but do not form any regular structures. They should not be confused with random coil, an unfolded polypeptide chain lacking any fixed three-dimensional structure. Several sequential secondary structures may form a "supersecondary unit".[6]

Tertiary structure

[edit]

Tertiary structure refers to the three-dimensional structure created by a single protein molecule (a single polypeptide chain). It may include one or several domains. The α-helices and β-pleated-sheets are folded into a compact globular structure. The folding is driven by the non-specific hydrophobic interactions, the burial of hydrophobic residues from water, but the structure is stable only when the parts of a protein domain are locked into place by specific tertiary interactions, such as salt bridges, hydrogen bonds, and the tight packing of side chains and disulfide bonds. The disulfide bonds are extremely rare in cytosolic proteins, since the cytosol (intracellular fluid) is generally a reducing environment.

Quaternary structure

[edit]

Quaternary structure is the three-dimensional structure consisting of the aggregation of two or more individual polypeptide chains (subunits) that operate as a single functional unit (multimer). The resulting multimer is stabilized by the same non-covalent interactions and disulfide bonds as in tertiary structure. There are many possible quaternary structure organisations.[7] Complexes of two or more polypeptides (i.e. multiple subunits) are called multimers. Specifically it would be called a dimer if it contains two subunits, a trimer if it contains three subunits, a tetramer if it contains four subunits, and a pentamer if it contains five subunits, and so forth. The subunits are frequently related to one another by symmetry operations, such as a 2-fold axis in a dimer. Multimers made up of identical subunits are referred to with a prefix of "homo-" and those made up of different subunits are referred to with a prefix of "hetero-", for example, a heterotetramer, such as the two alpha and two beta chains of hemoglobin.

Homomers

[edit]

An assemblage of multiple copies of a particular polypeptide chain can be described as a homomer, multimer or oligomer. Hundreds of proteins have been identified as being assembled into homomers in human cells. Homomer formation may be driven by interaction between nascent polypeptide chains as they are translated from mRNA by nearby adjacent ribosomes.[8]

Domains, motifs, and folds in protein structure

[edit]
Protein domains. The two shown protein structures share a common domain (maroon), the PH domain, which is involved in phosphatidylinositol (3,4,5)-trisphosphate binding.

Proteins are frequently described as consisting of several structural units. These units include domains, motifs, and folds. Despite the fact that there are about 100,000 different proteins expressed in eukaryotic systems, there are many fewer different domains, structural motifs and folds.

Structural domain

[edit]

A structural domain is an element of the protein's overall structure that is self-stabilizing and often folds independently of the rest of the protein chain. Many domains are not unique to the protein products of one gene or one gene family but instead appear in a variety of proteins. Domains often are named and singled out because they figure prominently in the biological function of the protein they belong to; for example, the "calcium-binding domain of calmodulin". Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimera proteins. A conservative combination of several domains that occur in different proteins, such as protein tyrosine phosphatase domain and C2 domain pair, was called "a superdomain" that may evolve as a single unit.[9]

Structural and sequence motifs

[edit]

The structural and sequence motifs refer to short segments of protein three-dimensional structure or amino acid sequence that were found in a large number of different proteins

Supersecondary structure

[edit]

Tertiary protein structures can have multiple secondary elements on the same polypeptide chain. The supersecondary structure refers to a specific combination of secondary structure elements, such as β-α-β units or a helix-turn-helix motif. Some of them may be also referred to as structural motifs.

Protein fold

[edit]

A protein fold refers to the general protein architecture, like a helix bundle, β-barrel, Rossmann fold or different "folds" provided in the Structural Classification of Proteins database.[10] A related concept is protein topology.

Protein dynamics and conformational ensembles

[edit]

Proteins are not static objects, but rather populate ensembles of conformational states. Transitions between these states typically occur on nanoscales, and have been linked to functionally relevant phenomena such as allosteric signaling[11] and enzyme catalysis.[12] Protein dynamics and conformational changes allow proteins to function as nanoscale biological machines within cells, often in the form of multi-protein complexes.[13] Examples include motor proteins, such as myosin, which is responsible for muscle contraction, kinesin, which moves cargo inside cells away from the nucleus along microtubules, and dynein, which moves cargo inside cells towards the nucleus and produces the axonemal beating of motile cilia and flagella. "[I]n effect, the [motile cilium] is a nanomachine composed of perhaps over 600 proteins in molecular complexes, many of which also function independently as nanomachines...Flexible linkers allow the mobile protein domains connected by them to recruit their binding partners and induce long-range allostery via protein domain dynamics. "[14]

Schematic view of the two main ensemble modeling approaches[15]

Proteins are often thought of as relatively stable tertiary structures that experience conformational changes after being affected by interactions with other proteins or as a part of enzymatic activity. However, proteins may have varying degrees of stability, and some of the less stable variants are intrinsically disordered proteins. These proteins exist and function in a relatively 'disordered' state lacking a stable tertiary structure. As a result, they are difficult to describe by a single fixed tertiary structure. Conformational ensembles have been devised as a way to provide a more accurate and 'dynamic' representation of the conformational state of intrinsically disordered proteins.[16][15]

Protein ensemble files are a representation of a protein that can be considered to have a flexible structure. Creating these files requires determining which of the various theoretically possible protein conformations actually exist. One approach is to apply computational algorithms to the protein data in order to try to determine the most likely set of conformations for an ensemble file. There are multiple methods for preparing data for the Protein Ensemble Database that fall into two general methodologies – pool and molecular dynamics (MD) approaches (diagrammed in the figure). The pool based approach uses the protein's amino acid sequence to create a massive pool of random conformations. This pool is then subjected to more computational processing that creates a set of theoretical parameters for each conformation based on the structure. Conformational subsets from this pool whose average theoretical parameters closely match known experimental data for this protein are selected. The alternative molecular dynamics approach takes multiple random conformations at a time and subjects all of them to experimental data. Here the experimental data is serving as limitations to be placed on the conformations (e.g. known distances between atoms). Only conformations that manage to remain within the limits set by the experimental data are accepted. This approach often applies large amounts of experimental data to the conformations which is a very computationally demanding task.[15]

The conformational ensembles were generated for a number of highly dynamic and partially unfolded proteins, such as Sic1/Cdc4,[17] p15 PAF,[18] MKK7,[19] Beta-synuclein[20] and P27[21]

Protein folding

[edit]

As it is translated, polypeptides exit the ribosome mostly as a random coil and folds into its native state.[22][23] The final structure of the protein chain is generally assumed to be determined by its amino acid sequence (Anfinsen's dogma).[24]

Protein stability

[edit]

Thermodynamic stability of proteins represents the free energy difference between the folded and unfolded protein states. This free energy difference is very sensitive to temperature, hence a change in temperature may result in unfolding or denaturation. Protein denaturation may result in loss of function, and loss of native state. The free energy of stabilization of soluble globular proteins typically does not exceed 50 kJ/mol.[citation needed] Taking into consideration the large number of hydrogen bonds that take place for the stabilization of secondary structures, and the stabilization of the inner core through hydrophobic interactions, the free energy of stabilization emerges as small difference between large numbers.[25]

Protein structure determination

[edit]
Examples of protein structures from the PDB
Rate of protein structure determination by method and year

Around 90% of the protein structures available in the Protein Data Bank have been determined by X-ray crystallography.[26] This method allows one to measure the three-dimensional (3-D) density distribution of electrons in the protein, in the crystallized state, and thereby infer the 3-D coordinates of all the atoms to be determined to a certain resolution. Roughly 7% of the known protein structures have been obtained by nuclear magnetic resonance (NMR) techniques.[27] For larger protein complexes, cryo-electron microscopy can determine protein structures. The resolution is typically lower than that of X-ray crystallography, or NMR, but the maximum resolution is steadily increasing. This technique is still a particularly valuable for very large protein complexes such as virus coat proteins and amyloid fibers.

General secondary structure composition can be determined via circular dichroism. Vibrational spectroscopy can also be used to characterize the conformation of peptides, polypeptides, and proteins.[28] Two-dimensional infrared spectroscopy has become a valuable method to investigate the structures of flexible peptides and proteins that cannot be studied with other methods.[29][30] A more qualitative picture of protein structure is often obtained by proteolysis, which is also useful to screen for more crystallizable protein samples. Novel implementations of this approach, including fast parallel proteolysis (FASTpp), can probe the structured fraction and its stability without the need for purification.[31] Once a protein's structure has been experimentally determined, further detailed studies can be done computationally, using molecular dynamic simulations of that structure.[32]

Protein structure databases

[edit]

A protein structure database is a database that is modeled around the various experimentally determined protein structures. The aim of most protein structure databases is to organize and annotate the protein structures, providing the biological community access to the experimental data in a useful way. Data included in protein structure databases often includes 3D coordinates as well as experimental information, such as unit cell dimensions and angles for x-ray crystallography determined structures. Though most instances, in this case either proteins or a specific structure determinations of a protein, also contain sequence information and some databases even provide means for performing sequence based queries, the primary attribute of a structure database is structural information, whereas sequence databases focus on sequence information, and contain no structural information for the majority of entries. Protein structure databases are critical for many efforts in computational biology such as structure based drug design, both in developing the computational methods used and in providing a large experimental dataset used by some methods to provide insights about the function of a protein.[33]

Structural classifications of proteins

[edit]

Protein structures can be grouped based on their structural similarity, topological class or a common evolutionary origin. The Structural Classification of Proteins database[34] and CATH database[35] provide two different structural classifications of proteins. When the structural similarity is large the two proteins have possibly diverged from a common ancestor,[36] and shared structure between proteins is considered evidence of homology. Structure similarity can then be used to group proteins together into protein superfamilies.[37] If shared structure is significant but the fraction shared is small, the fragment shared may be the consequence of a more dramatic evolutionary event such as horizontal gene transfer, and joining proteins sharing these fragments into protein superfamilies is no longer justified.[36] Topology of a protein can be used to classify proteins as well. Knot theory and circuit topology are two topology frameworks developed for classification of protein folds based on chain crossing and intrachain contacts respectively.

Computational prediction of protein structure

[edit]

The generation of a protein sequence is much easier than the determination of a protein structure. However, the structure of a protein gives much more insight in the function of the protein than its sequence. Therefore, a number of methods for the computational prediction of protein structure from its sequence have been developed.[38] Ab initio prediction methods use just the sequence of the protein. Threading and homology modeling methods can build a 3-D model for a protein of unknown structure from experimental structures of evolutionarily-related proteins, called a protein family.

Predictive machine learning based-approaches tackle the structure problem at multiple levels. At the 1D level secondary structure and solvent accessibility are predicted. The 2D level works on distances and points of contact along the protein chain; these predictions are orientation independent. At the 3D level the coordinates of all the atoms in the protein are estimated; this level is the primary goal of most prediction efforts. Finally the 4D level predicts complexes of multiple proteins. Progress at these levels is assessed annually at the biannual Critical Assessment of Structure Prediction event.[39] The results from structure studies can be fed in to machine learning techniques deployed to understand protein-protein interactions.[40]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Protein structure refers to the of atoms within a , which dictates its shape, stability, and biological function. Proteins are linear polymers composed of 20 standard linked by bonds to form one or more polypeptide chains, with typical lengths ranging from 50 to 2,000 residues. This arrangement enables proteins to perform diverse roles, including enzymatic , structural support, transport, and signaling, as their three-dimensional conformation allows precise interactions with other molecules. The four levels of protein structure—primary, secondary, tertiary, and —build upon one another, stabilized by noncovalent and covalent interactions such as bonds, hydrophobic effects, electrostatic forces, van der Waals interactions, and disulfide bridges. The primary structure is the simplest level, defined as the specific linear of in a polypeptide chain, determined by the and covalent bonds between the carboxyl group of one and the amino group of the next. Variations in this , such as single substitutions, can profoundly alter protein function, as seen in diseases like sickle cell anemia where a glutamate-to-valine change disrupts stability. This serves as the blueprint for higher-order folding, with the chemical properties of side chains (R groups)—ranging from hydrophobic to charged—driving subsequent structural organization. At the secondary structure level, the polypeptide backbone folds locally into repeating patterns, primarily α-helices and β-sheets, stabilized by hydrogen bonds between the carbonyl oxygen and amide hydrogen of the backbone. These motifs, first elucidated in the mid-20th century through studies of fibrous proteins like and silk fibroin, contribute to the protein's overall rigidity and flexibility, with α-helices forming coiled rods and β-sheets creating pleated structures that can be parallel or antiparallel. Secondary elements often cluster to form compact domains, modular units of 40–350 that can function independently or combine for complex activities. The tertiary structure encompasses the global three-dimensional folding of a single polypeptide chain, where secondary elements and side chains pack together to form a compact, globular (or elongated fibrous) shape, driven by the hydrophobic burial of nonpolar residues in the core and exposure of polar ones on the surface. This level is crucial for active sites in enzymes or binding interfaces, with folding often assisted by molecular chaperones to prevent aggregation and ensure correct conformation under physiological conditions. Misfolding at this stage can lead to pathological states, such as amyloid fibrils in . Many proteins exhibit quaternary structure, in which multiple polypeptide subunits (homomers or heteromers) assemble into a functional complex, further stabilized by the same noncovalent interactions as tertiary structure, plus potential interchain bonds. Examples include hemoglobin, a tetramer that enables cooperative oxygen binding, illustrating how quaternary assembly enhances efficiency and regulation. Across , only about 2,000 distinct folds have been identified among known structures, underscoring the efficiency of these principles in generating functional diversity from a limited set of building blocks.

Levels of protein structure

Primary structure

The primary structure of a protein refers to the linear of in a polypeptide chain, connected covalently by peptide bonds between the carboxyl group of one and the amino group of the next. This forms the foundational backbone of the protein and is uniquely determined for each protein type. The polypeptide chain has directionality, with the (amino terminus) featuring a free amino group (-NH₂) at one end and the (carboxyl terminus) bearing a free carboxyl group (-COOH) at the other; by convention, the primary structure is described from the N- to . The often serves as the initiation site for protein synthesis, while the can influence stability and interactions. The primary structure arises from the process of , in which (mRNA) sequences are decoded into chains according to the . During , ribosomes read mRNA in triplets called codons, each specifying one of the 20 standard via transfer RNAs (tRNAs) that match codons to their corresponding . This , nearly universal across organisms, ensures that the sequence in mRNA directly dictates the precise order of in the protein. The 20 standard are distinguished by their side chains (R groups), which vary in size, shape, charge, and polarity—ranging from nonpolar hydrophobic groups like those in and to polar uncharged ones in serine and threonine, acidic ones in and , and basic ones in and —imparting specific chemical properties that influence the protein's overall behavior. The primary structure is critical for a protein's identity, folding, and biological function, as even minor alterations can disrupt these processes. For instance, in sickle cell anemia, a in the β-globin substitutes (a hydrophobic ) for (hydrophilic) at the sixth position of the β-chain, leading to abnormal protein aggregation and red blood cell deformation. Such mutations highlight how the exact sequence governs protein stability and activity. Analytical techniques for determining primary structure include , which uses phenylisothiocyanate to selectively cleave and identify the N-terminal in successive cycles, allowing sequencing of up to 50-60 residues. complements this by fragmenting peptides and measuring mass-to-charge ratios to infer the full sequence, often via tandem MS/MS for de novo sequencing of complex proteins. The primary structure provides the template that influences the formation of higher-order structures.

Secondary structure

Secondary structure describes the local, regular conformations of the polypeptide backbone in a protein, primarily stabilized by bonds between the backbone carbonyl oxygen (C=O) of one residue and the (N-H) of another residue. These interactions occur within the backbone atoms, independent of side-chain effects, and give rise to repetitive structural motifs that form the building blocks of higher-order protein architecture. Unlike the linear primary sequence, secondary structures impose spatial constraints on the chain, influencing flexibility and overall folding propensity. The predominant secondary structural elements are alpha-helices and beta-sheets. An alpha-helix is a right-handed coiled structure in which the backbone forms a cylindrical spiral, with approximately 3.6 residues per helical turn and a pitch of 5.4 . The stabilization arises from intra-chain hydrogen bonds between the carbonyl oxygen of residue i and the hydrogen of residue i+4, aligning parallel to the axis and spaced about 2.8 apart. This configuration was first proposed by and Robert Corey based on stereochemical modeling of polypeptide chains. Alpha-helices are common in proteins, comprising about 30% of residues in globular proteins, and often cluster to form hydrophobic cores. Beta-sheets consist of extended polypeptide strands, typically 5–10 residues long, that align either in a parallel (strands running in the same N-to-C direction) or antiparallel (opposite directions) fashion to form a pleated sheet-like array. Hydrogen bonds form between the carbonyl oxygen and of adjacent strands, creating a network of bonds perpendicular to the strand direction and resulting in a twisted, pleated appearance due to the tetrahedral of the alpha-carbon. Antiparallel beta-sheets are more stable than parallel ones because their hydrogen bonds are more linear. Beta-sheets often form the core of beta-rich proteins like silk fibroin and contribute to the rigidity of structures such as immunoglobulin domains. This motif was also theoretically derived by Pauling and Corey shortly after the alpha-helix model. Other secondary structural elements include beta-turns, loops, and less common variants like pi-helices. Beta-turns are tight, four-residue reversals in the polypeptide chain that connect successive beta-strands or other elements, allowing the chain to fold back on itself; they are classified into types I, II, and III based on dihedral angles, with type I being the most frequent (about 50% of turns). Loops are irregular, non-repetitive segments lacking consistent bonding patterns, often exposed to and varying in length from a few to tens of residues; they provide flexibility and can harbor functional sites. Pi-helices, a wider variant of the alpha-helix with 4.4 residues per turn and hydrogen bonds between residues i and i+5, occur infrequently (less than 1% of helical residues) and are typically found at the ends of alpha-helices or in distorted regions. The conformational possibilities of the polypeptide backbone are constrained by steric hindrance and visualized in the , which maps the dihedral angles (φ, rotation around the N-Cα bond) and psi (ψ, rotation around the Cα-C bond) for each residue. Allowed regions correspond to sterically favorable conformations: the alpha-helix occupies φ ≈ -60°, ψ ≈ -45°; beta-sheets cluster around φ ≈ -120°, ψ ≈ +120°; and beta-turns appear in broader areas. , lacking a beta-carbon, populates more regions due to reduced steric clash, while is restricted by its ring structure. Disallowed areas represent high-energy clashes, ensuring backbone planarity near the (ω ≈ 180°). This plot, derived from model-building and energy calculations, underscores the limited flexibility of the backbone despite 20 possible side chains. Secondary structures are predicted from amino acid sequences using empirical methods like the Chou-Fasman algorithm, which assigns propensity values (Pα, Pβ) to each residue based on their observed frequencies in known alpha-helices and beta-sheets from structures. For example, has high Pα (1.42) favoring helices, while favors beta-sheets (Pβ = 1.70). The algorithm scans the sequence for nucleating segments where four of six residues have P > 1.00, then extends until broken by helix-breakers like . Though accuracy is around 50–60%, it provided early insights into sequence-structure relationships before advanced approaches. Functionally, secondary elements contribute to active sites and stability; for instance, in , eight alpha-helices (labeled A–H) pack around the , forming a hydrophobic pocket that positions the iron for reversible oxygen binding and protects it from oxidation. This helical arrangement enables 's role as an protein in muscle tissue. These local motifs ultimately pack via hydrophobic interactions and side-chain packing to form the tertiary structure.

Tertiary structure

Tertiary structure refers to the overall three-dimensional arrangement of a single polypeptide chain, resulting from the folding of its secondary structural elements into a compact, functional conformation. This level of encompasses the spatial positioning of all atoms in the backbone and side chains ( groups), enabling the protein to perform its biological . Unlike secondary structure, which involves local hydrogen bonding along the backbone, tertiary structure is stabilized primarily by interactions between distant parts of the chain or between side chains. The stability of tertiary structure arises from a combination of non-covalent and covalent interactions. Hydrophobic interactions drive nonpolar side chains to cluster away from the aqueous environment, forming a hydrophobic core that minimizes contact with . Ionic bonds, or salt bridges, form between oppositely charged side chains, such as those of aspartate and , contributing to structural rigidity. Hydrogen bonds occur between polar side chains or between side chains and the backbone, beyond those in secondary structures. Van der Waals forces provide weak attractions between closely packed atoms in the core, while bridges—covalent bonds between residues—offer additional stabilization, particularly in extracellular proteins. In the folded state, the hydrophobic core consists of buried nonpolar residues, shielded from , while polar and charged residues are typically exposed on the surface, facilitating and interactions with other molecules. This core formation is a key energetic driver of folding, as the of hydrophobic surfaces reduces unfavorable loss in surrounding molecules. Side-chain interactions, including salt bridges and bonds, further fine-tune the packing, ensuring precise alignment of functional groups. Globular proteins, such as the monomeric subunit of , exemplify compact tertiary structures optimized for enzymatic or transport functions, with a hydrophobic interior and hydrophilic exterior. In contrast, fibrous proteins like α-keratin display elongated tertiary folds, often dominated by extended secondary elements, providing mechanical strength in tissues. These examples highlight how tertiary architecture adapts to diverse roles, from in cellular environments to . Denaturation disrupts tertiary structure through agents like heat, , or changes, leading to reversible unfolding where the polypeptide expands and loses its native conformation. Classic experiments on demonstrated this reversibility: upon removal of denaturants and reoxidation of bonds, the protein refolds spontaneously into its functional form, underscoring that tertiary structure is thermodynamically determined by the sequence. Tertiary folds exhibit greater evolutionary conservation than primary , allowing proteins with low sequence identity to maintain similar three-dimensional architectures across . This structural persistence facilitates functional divergence while preserving , as seen in homologous proteins where mutations accumulate in surface loops but spare the buried core. In multi-subunit proteins, the tertiary fold of individual chains serves as the foundation for subsequent assembly.

Quaternary structure

Quaternary structure describes the non-covalent associations, and occasionally covalent linkages, of multiple polypeptide subunits to form a functional . These interactions typically occur between the folded tertiary structures of individual subunits, enabling the assembly of larger, often architectures essential for . Protein complexes exhibit various types, including homodimers composed of two identical subunits, heterodimers with two different subunits, and higher-order oligomers such as tetramers or larger assemblies. A prominent example is heterotetramer (α₂β₂) that facilitates oxygen transport in blood, where the subunits assemble via hydrophobic and electrostatic interactions at their interfaces. Homomers, formed by identical subunits, predominate among proteins with known quaternary structures, comprising 50-70% of such cases across diverse proteomes, which underscores their evolutionary prevalence and role in simplifying assembly. Subunit interfaces in quaternary structures involve specific contacts, such as bonds, salt bridges, and der Waals interactions, that stabilize the complex and often mediate allosteric effects, where binding at one site influences activity at another. These interfaces confer functional advantages, including enhanced stability against denaturation, regulatory control through , and specialization of active sites spanning multiple subunits—as seen in aspartate transcarbamoylase, where the catalytic sites lie at the boundaries between catalytic subunits to enable substrate binding and . Quaternary assemblies can also dissociate under environmental cues, such as shifts or binding; for instance, tetramers reversibly break into αβ dimers at low or upon release, modulating oxygen affinity and preventing aggregation.

Structural domains, motifs, and folds

Structural domains

Structural domains are compact, semi-independent folding units within a protein that typically range from 50 to 200 in length and are often connected by flexible linker regions, allowing them to function autonomously while contributing to the overall protein architecture. These domains encompass arrangements of secondary structural elements, such as alpha helices and beta sheets, organized into a stable tertiary fold. Multi-domain proteins, which incorporate two or more such domains, are highly prevalent in nature, comprising more than 80% of proteins in eukaryotic organisms, compared to about 67% in prokaryotes. In some cases, domains can participate in domain swapping, a process where identical or similar protein monomers exchange structural elements to form dimers or higher-order oligomers, thereby modulating protein function and stability. Protein domains serve diverse roles, including , binding, and regulation of protein activity. For instance, catalytic domains house the active sites for enzymatic reactions, while binding domains facilitate interactions with substrates, cofactors, or other molecules; regulatory domains, such as the Src homology 2 (SH2) domain, bind phosphorylated residues to propagate signaling cascades in cellular pathways like signaling. These functional specializations enable multi-domain proteins to integrate multiple processes within a single polypeptide chain. From an evolutionary perspective, structural domains have expanded the functional diversity of proteomes through mechanisms like domain duplication, where a segment of the gene encoding a domain is copied internally, and domain shuffling via genetic recombination, which rearranges domains between different proteins to create novel architectures. These processes, often facilitated by exon shuffling in eukaryotes, have driven the complexity of multi-domain proteins over evolutionary time. Identification of structural domains can be achieved through experimental structure determination, such as or cryo-electron microscopy, which reveals compact folding units, or computationally via sequence analysis using databases like , which catalogs domain families based on hidden Markov models derived from multiple sequence alignments. A prominent example of structural domains is found in antibodies, where immunoglobulin domains adopt a characteristic beta-sandwich fold consisting of two antiparallel beta sheets stabilized by a bond, enabling recognition and immune response modulation.

Sequence and structural motifs

Sequence and structural motifs are short, conserved patterns in protein sequences or three-dimensional structures that often confer specific biochemical functions, such as binding or . These motifs typically span fewer than 50 residues and are frequently embedded within larger protein domains, distinguishing them from independently folding structural domains. Unlike domains, which represent modular, self-contained units capable of folding autonomously, motifs serve as functional signatures that can occur in diverse structural contexts. Sequence motifs consist of linear patterns of that are conserved across evolutionarily related proteins due to their functional importance. A prominent example is the Walker A and B motifs, identified in ATP-binding proteins, where the Walker A motif follows the consensus GxxxxGK[T/S] and forms a phosphate-binding loop (P-loop) that interacts with the γ-phosphate of ATP, while the Walker B motif (hhhhDE, where h is a hydrophobic residue, D is aspartate, and E is glutamate) coordinates a magnesium essential for hydrolysis. These motifs were first recognized in nucleotide-binding enzymes like subunits and kinases. Sequence motifs are detected using regular expressions or pattern-matching algorithms in databases like , which compiles biologically significant patterns for functional annotation of uncharacterized proteins. Such sequence motifs often define binding sites or catalytic residues critical for enzymatic activity. For instance, the in serine proteases—comprising serine, , and aspartate residues arranged to facilitate nucleophilic attack on peptide bonds—enables efficient and is conserved across families like and , despite low overall sequence similarity.85760-6/fulltext) Structural motifs, in contrast, refer to recurring three-dimensional arrangements that may not be evident from sequence alone but are crucial for function. The motif, a compact ββα fold stabilized by a ion coordinated to and residues, enables DNA binding in transcription factors like TFIIIA, where tandem repeats recognize specific sequences. Similarly, the motif features two α-helices with leucine residues at every seventh position forming a coiled-coil dimer interface, facilitating protein-protein interactions in transcription factors such as C/EBP. Another example is the EF-hand motif, a helix-loop-helix structure with a 12-residue loop that binds calcium ions via oxygen-containing side chains, as seen in , where it triggers conformational changes for .77292-7/fulltext) Detection of structural motifs relies on geometric searches in protein structure databases like the (PDB), using algorithms that match spatial arrangements of secondary elements or atom coordinates, often integrated into tools like those in the CATH or classifications. These motifs underpin diverse functions, including metal ion coordination, dimerization, and , and their conservation highlights evolutionary pressures for functional specificity.

Supersecondary structures

Supersecondary structures, also known as motifs, represent recurring combinations of two or more secondary structural elements, such as α-helices and β-strands, connected by short loops or turns, forming compact and stable spatial units that serve as intermediate building blocks between secondary and tertiary levels of protein organization. These structures are characterized by specific geometric arrangements stabilized by bonds, hydrophobic interactions, and van der Waals forces, often exhibiting enhanced rigidity compared to isolated secondary elements. Common examples of supersecondary structures include the motif, consisting of two α-helices linked by a short β-turn, which provides a stable scaffold frequently observed in regulatory proteins. Another prevalent motif is the β-α-β unit, where two parallel β-strands are connected by an intervening α-helix in a right-handed crossover, contributing to the core of many enzymatic domains. The β-hairpin, formed by two antiparallel β-strands joined by a tight loop of 2–5 residues, exemplifies a simple yet versatile structure that can fold independently and is classified into types based on loop conformation and hydrogen bonding patterns. A more extended example is the Rossmann fold, comprising multiple tandem β-α-β motifs arranged around a central β-sheet, which is widely distributed in proteins binding dinucleotides like NAD⁺ and was first systematically described in comparative analyses of dehydrogenases.90088-3) These supersecondary structures play a crucial role in protein architecture by acting as modular building blocks that assemble into larger structural domains and folds, facilitating efficient packing and functional organization. Their stability arises from the close packing of adjacent secondary elements, which minimizes exposure and maximizes non-covalent interactions; for instance, isolated β-hairpins and α-α-corners have been shown to maintain native-like conformations in peptide fragments through spectroscopic studies. Evolutionarily, supersecondary structures exhibit high conservation across diverse protein families, even in non-homologous sequences, indicating their emergence as ancient folding nuclei that have been reused and diversified throughout protein evolution, as evidenced by phylogenetic analyses of motifs like the Rossmann fold in cofactor-binding enzymes. Prediction of these structures relies on sequence-based methods that leverage propensities for secondary element formation and loop flexibility, including early statistical approaches recognizing patterns in sequences and modern models trained on structural databases like the . Such predictions integrate supersecondary units into models of tertiary folds to guide overall structure determination.

Protein folds

Protein folds represent the distinctive three-dimensional topologies formed by the backbone of polypeptide chains, encompassing the overall arrangement of secondary structural elements without regard to the specific sequence. These topologies are recurrent patterns observed across diverse proteins, such as the , which features a central cylinder of eight parallel β-strands encircled by eight α-helices, facilitating enzymatic activity in numerous metabolic pathways. Another prominent example is the immunoglobulin fold, a β-sandwich structure composed of two Greek key β-sheets packed against each other, commonly found in immune recognition proteins. Despite the immense diversity of protein sequences—estimated at over 10^12 possible 100-residue polypeptides—the structural fold space is remarkably constrained, with approximately 2,000 distinct folds cataloged in major databases as of 2024. Recent AI-driven predictions, such as those from , have expanded the cataloged folds, revealing nearly 200 new ones in 2024, further illuminating . This limitation arises from biophysical constraints on stable, functional architectures, allowing unrelated sequences to independently evolve into the same fold through , where selective pressures favor similar structural solutions for analogous roles. In contrast, divergent evolution preserves folds within homologous protein families descended from a common . Such convergence is evident in cases like the Rossmann fold, a β-α-β motif repeated to form a nucleotide-binding domain, which has been adopted by dehydrogenases and other enzymes handling NAD(P)-dependent reactions across distant lineages. Protein folds are systematically classified in hierarchical databases like (Structural Classification of Proteins) and CATH (Class, Architecture, , and Homologous superfamily), which delineate folds based on topological connectivity and geometric similarity of secondary elements. organizes structures into classes, folds, superfamilies, and families, emphasizing evolutionary relationships, while CATH focuses on architectural and topological descriptors to group domains. These resources reveal that specific folds often constrain functional possibilities; for example, the Rossmann fold predominantly supports coenzyme-binding roles in , limiting the range of reactions it can accommodate. Beyond sequence-based methods, fold comparison enables the detection of distant homologs by identifying shared topologies obscured by low identity, thus inferring evolutionary and functional connections in proteins diverged over billions of years. Tools leveraging structural alignments, such as those in and CATH, facilitate this by quantifying similarities in geometry, aiding in of uncharacterized proteins. This approach underscores how space exploration bridges gaps in understanding protein and multifunctionality.

Protein dynamics and conformational changes

Protein dynamics

Protein dynamics refer to the time-dependent fluctuations and movements within protein structures that occur even in their native states, influencing their biological functions. These motions range from small-scale atomic vibrations to large-scale domain rearrangements, allowing proteins to adapt to environmental changes and interact with other molecules. Understanding these dynamics is essential, as they underpin processes like enzymatic activity and molecular recognition. Key types of motions in proteins include side-chain rotations, which involve the conformational changes of side chains around their chi angles; loop flexibility, where flexible loops undergo bending or twisting; and hinge bending, characterized by rigid-body rotations between structural domains connected by flexible hinges. Side-chain rotations enable local adjustments for substrate positioning, while loop flexibility facilitates access to active sites, and hinge bending allows for overall changes in multi-domain proteins. These motions occur across a broad spectrum of timescales, from for bond vibrations and side-chain fluctuations to milliseconds for domain movements and bending. Vibrational motions in the range involve stretching and bending of covalent bonds, whereas slower to timescales capture loop and side-chain dynamics, and millisecond events correspond to larger conformational shifts. Molecular dynamics (MD) simulations and are primary techniques for probing protein dynamics. MD simulations model atomic trajectories over time using force fields to predict motions from femtoseconds to microseconds, providing insights into inaccessible experimental timescales. , such as NMR spin relaxation or methods, captures real-time structural changes by monitoring spectroscopic signals after perturbation, revealing dynamics on to scales. Protein dynamics play critical functional roles, including enabling by positioning substrates in enzyme active sites, facilitating binding through transient openings, and mediating allostery where motions in one region propagate signals to distant sites. For instance, breathing motions—collective expansions and contractions of the protein core—allow enzymes like to accommodate substrates and release products, enhancing catalytic efficiency. A significant aspect of protein dynamics is intrinsic disorder, where certain regions lack a fixed three-dimensional structure and instead exist as dynamic ensembles under physiological conditions. These intrinsically disordered regions (IDRs) are prevalent, occurring in over 70% of proteins and particularly in signaling proteins, where about 66% contain long disordered segments that enable flexible interactions with multiple partners. IDRs contribute to dynamics by allowing rapid conformational sampling essential for regulatory functions. These dynamic processes contribute to the broader conformational ensembles that proteins sample, linking microscopic motions to functional versatility.

Conformational ensembles

Proteins in their native environments exist not as rigid, single structures but as dynamic ensembles of multiple three-dimensional conformations that interconvert under physiological conditions. This view challenges the traditional static model derived from early , emphasizing instead the inherent flexibility essential for biological function. The conformational ensemble represents the Boltzmann-weighted population of states accessible to the protein, where each conformation's occupancy is determined by its free energy relative to others. The distribution of conformations within an ensemble follows the , governed by the equation Pi=eΔGi/RTjeΔGj/RTP_i = \frac{e^{-\Delta G_i / RT}}{\sum_j e^{-\Delta G_j / RT}}, where PiP_i is the probability of conformation ii, ΔGi\Delta G_i is its free energy difference from the state, RR is the , and TT is the . Lower-energy conformations dominate the , while higher-energy, low-population states can still contribute to function if transiently stabilized. Experimental sampling of these ensembles relies on techniques that probe structural heterogeneity and populations. (NMR) relaxation measurements, such as 15^{15}N spin relaxation rates, reveal conformational exchange on microsecond-to-millisecond timescales by quantifying order parameters and times that reflect motional amplitudes across the ensemble. Similarly, single-molecule Förster resonance energy transfer (smFRET) tracks real-time distance fluctuations between fluorophore-labeled sites, enabling direct observation of conformational subpopulations and their interconversion kinetics without ensemble averaging. Conformational ensembles play a critical role in protein-ligand interactions, where mechanisms such as conformational selection—ligand binding to a rare, pre-existing state that shifts the equilibrium—and induced fit—binding that actively drives a conformational change—facilitate specificity and efficiency. These processes are not mutually exclusive; many systems exhibit hybrid behaviors, with selection dominating for low-affinity initial encounters and induced fit stabilizing the bound state. A representative example is , an that catalyzes phosphate transfer and alternates between an open, substrate-accessible conformation (predominant in the apo form) and a closed, catalytically active state upon binding ATP and AMP, with the ensemble allowing rapid transitions essential for its kinetic cycle. These ensembles arise from underlying dynamic motions that sample the energy landscape, though the focus here is on the equilibrium distribution rather than transient kinetics. Advances in have enhanced ensemble characterization, particularly through cryo-electron microscopy (cryo-EM), which captures snapshots of heterogeneous states in near-native conditions and, when combined with computational modeling, resolves multiple conformers and their relative populations from vitrified samples. This integration overcomes limitations of traditional methods by accommodating larger, more complex systems and providing atomic-level insights into lowly populated states.

Protein folding

Folding mechanisms

Protein folding mechanisms describe the physical processes by which polypeptide chains transition from disordered, unfolded states to their functional native conformations, navigating an immense conformational space in biologically feasible timescales. The Levinthal paradox highlights the challenge of this process: for a typical protein with 100 residues, each capable of sampling approximately 3 possible conformations per residue, the total number of possible structures exceeds 10^47, yet proteins fold in milliseconds to seconds, implying that random sampling would take longer than the age of the . This paradox underscores that folding cannot proceed via exhaustive random search but must follow directed pathways biased by the protein's energy landscape. The folding funnel model resolves the Levinthal paradox by conceptualizing the protein's free energy landscape as a funnel-shaped surface, where the unfolded ensemble at high energy and progressively loses while decreasing free energy toward the native state at the bottom. In this statistical mechanical framework, evolutionarily optimized sequences minimize energetic , creating smooth funnels that guide folding without deep kinetic traps, enabling rapid convergence to the native . The funnel's ruggedness reflects local minima, but overall bias toward the native state ensures efficient folding for minimally frustrated proteins. Proteins exhibit either two-state or multi-state folding kinetics, depending on their size and topology. In two-state folding, the transition from unfolded (U) to native (N) state is cooperative, with no detectable populated intermediates, as seen in small, single-domain proteins like chymotrypsin inhibitor 2 (CI2), where the folding rate is limited by a single high-energy transition state. Multi-state folding involves obligatory on-pathway intermediates, common in larger proteins, where partial structures form sequentially before reaching the native state, allowing for more complex energy landscapes with multiple barriers. The distinction arises from the protein's foldon units—cooperative substructures that nucleate folding—and is probed by comparing equilibrium and kinetic unfolding rates. A key mechanism in both two-state and multi-state folding is nucleation-condensation, where an initial nucleus of ordered secondary structure forms, followed by rapid condensation of the remaining chain around it to stabilize tertiary interactions. In CI2, for example, a diffuse nucleus involving the C-terminal and beta-sheet initiates folding, with the featuring partial native-like interactions that propagate structure formation. This hybrid mechanism combines elements of framework (secondary structure first) and hydrophobic collapse models, optimizing folding rates by coupling local and nonlocal contacts early. It predominates in small globular proteins, ensuring cooperative transitions without stable off-pathway species. Despite directed pathways, folding landscapes contain off-pathway traps where misfolded conformations form kinetic dead-ends, leading to aggregates or . These traps arise from sequence-specific frustrations, such as improper hydrophobic burial or beta-strand mispairing, slowing productive folding. A prominent example is prion proteins, where the cellular PrP^C (alpha-helical) can misfold into the beta-sheet-rich PrP^Sc isoform, seeding self-propagating aggregates that cause transmissible spongiform encephalopathies. Such off-pathway events highlight the role of kinetic partitioning in folding efficiency, with misfolding rates increasing under cellular stress. Experimental probes elucidate these mechanisms through kinetic and structural analyses. Phi (Φ)-value analysis quantifies transition-state structure by measuring changes in folding/unfolding rates and stabilities upon mutations, where Φ ≈ 1 indicates native-like interactions and Φ ≈ 0 suggests unfolded-like; in barnase, Φ values revealed a polarized transition state with structured core. Stopped-flow kinetics, using rapid mixing to initiate refolding and monitor fluorescence or absorbance changes, resolves millisecond-scale transitions, distinguishing two-state chevron plots (linear) from multi-state curvatures indicative of intermediates. In vivo, these intrinsic mechanisms are supported by chaperones that prevent aggregation, but the core pathways remain sequence-determined.

Chaperones and folding assistants

Molecular chaperones, particularly the heat shock protein (Hsp) families, play essential roles in assisting within the crowded cellular environment by preventing misfolding and aggregation of nascent or stress-damaged polypeptides. These proteins do not impart a specific folded structure but instead facilitate the correct assembly through transient interactions that shield hydrophobic regions exposed in unfolded states. Among the major classes, and Hsp60 (chaperonins) represent key types that operate via distinct but complementary mechanisms to promote productive folding pathways. Hsp70 chaperones, such as the bacterial DnaK and eukaryotic Hsc70 or inducible , bind to unfolded polypeptide chains, stabilizing them in a conformation competent for folding. This binding occurs through an ATP-dependent cycle: in the ATP-bound state, the substrate-binding domain (SBD) adopts an open conformation with low affinity for substrates; , stimulated by co-chaperones, transitions the SBD to a closed, high-affinity state that clamps onto hydrophobic segments of the unfolded chain, effectively isolating it from aggregation-prone interactions. Nucleotide exchange factors then promote ADP release and ATP rebinding, releasing the substrate to allow folding attempts. This iterative cycle prevents premature aggregation and enables repeated binding-release events, increasing the likelihood of reaching the native state. In contrast, Hsp60 chaperonins, exemplified by the bacterial , function by encapsulating substrates within a protected cavity to isolate them during folding. forms a double-ring structure with 14 identical subunits, each containing apical, intermediate, and equatorial domains; the equatorial domain binds ATP, while the apical domain captures unfolded proteins via hydrophobic grooves. Upon ATP binding and , the co-chaperonin GroES caps one ring, enlarging the central cavity into a hydrophilic environment that expels bound water and hydrophobic residues, promoting substrate expansion and folding. This encapsulation mechanism sequesters a single substrate protein per cycle, preventing intermolecular associations that lead to aggregates, and the process repeats for iterative annealing until the native fold is achieved. Co-chaperones regulate these cycles for specificity and efficiency. Hsp40 (DnaJ homologs), J-domain-containing proteins, target unfolded chains to by first binding substrates themselves and then stimulating 's ATPase activity up to 1000-fold through interaction with its nucleotide-binding domain. This enhances substrate delivery and clamps the complex during the high-affinity phase. Hop (STIP1), another co-chaperone, bridges and by binding their C-terminal motifs via TPR domains, facilitating transfer of partially folded clients to for further maturation in a coordinated chaperone network. These regulators ensure timely progression through folding stages, preventing kinetic traps. Chaperones are vital for de novo folding of newly synthesized proteins emerging from ribosomes, where Hsp70 systems capture nascent chains co-translationally to avert aggregation in the cytosol. Under stress conditions, such as heat shock, they also mediate refolding of denatured proteins; for instance, Hsp70 solubilizes aggregates in cooperation with disaggregases, allowing recapture and iterative folding attempts. Chaperonins like GroEL similarly assist refolding by providing an isolated compartment, as demonstrated with substrates like rhodanese, where encapsulation yields up to 90% recovery of native activity post-denaturation. In eukaryotes, the TRiC (or CCT) chaperonin serves as a functional analog to , folding approximately 10% of the cytosolic , including and . Composed of eight distinct subunits forming a hetero-oligomeric double ring, TRiC uses an inherent lid mechanism without a separate co-chaperonin like GroES; drives asymmetric conformational changes, sequentially closing the chamber to create a polarized environment that guides substrate folding. It often cooperates with prefoldin for delivery of obligate substrates, highlighting its specialized role in complex eukaryotic folding. Deficiencies in chaperone function contribute to neurodegenerative diseases characterized by protein aggregation. In conditions like Alzheimer's and Parkinson's, impaired Hsp70 activity leads to accumulation of misfolded tau or α-synuclein, exacerbating neuronal toxicity due to failed refolding and clearance. Similarly, reduced TRiC efficiency disrupts cytoskeletal protein folding, promoting amyloid formation and synaptic loss in Huntington's disease models. These chaperone deficits underscore their protective role against proteotoxic stress in the aging brain.

Protein stability

Thermodynamic principles

The native conformation of a protein represents the thermodynamically most stable state under physiological conditions, corresponding to the global minimum of the landscape. The stability of this native state relative to the unfolded ensemble is quantified by the standard free energy change for unfolding, ΔG=GUGN=ΔHTΔS\Delta G^\circ = G_U - G_N = \Delta H - T\Delta S, where ΔH\Delta H is the change, TT is the absolute temperature, and ΔS\Delta S is the change; a positive ΔG\Delta G^\circ ensures the native state predominates at equilibrium. This thermodynamic framework underpins , which posits that the sequence of a protein encodes the information necessary for it to achieve its thermodynamically favored native structure spontaneously , as demonstrated by refolding experiments on A. Proteins exhibit marginal thermodynamic stability, with the native state typically only 5–15 kcal/mol more stable than the unfolded state under ambient conditions, allowing functional flexibility while preventing aggregation. This narrow energy margin arises from a delicate balance of enthalpic and entropic contributions, where unfolding exposes hydrophobic residues to , leading to a characteristic positive change, ΔCp>0\Delta C_p > 0, typically on the order of 1–3 kcal/mol· for small proteins. The ΔCp\Delta C_p term influences the temperature dependence of ΔG\Delta G^\circ via the Gibbs-Helmholtz relation, ΔG(T)=ΔH(T0)+T0TΔCpdTT[ΔS(T0)+T0TΔCpTdT]\Delta G(T) = \Delta H(T_0) + \int_{T_0}^T \Delta C_p \, dT - T \left[ \Delta S(T_0) + \int_{T_0}^T \frac{\Delta C_p}{T} \, dT \right], resulting in parabolic stability curves that peak near and enable both heat and cold denaturation. For many globular proteins, unfolding follows a two-state model, approximating an all-or-nothing transition between native (N) and unfolded (U) states without stable intermediates: NUN \rightleftharpoons U. The equilibrium constant is K=[U][N]=eΔG/RTK = \frac{[U]}{[N]} = e^{-\Delta G / RT}, where RR is the gas constant, allowing extrapolation of stability parameters from denaturation experiments using denaturants or temperature. Differential scanning calorimetry (DSC) directly measures the heat capacity as a function of temperature, yielding unfolding endotherms from which ΔH\Delta H, TmT_m (midpoint temperature), and ΔCp\Delta C_p are derived to construct comprehensive stability profiles.

Factors influencing stability

Protein stability is modulated by a variety of environmental and molecular factors that alter the balance between the folded and unfolded states, primarily by influencing the free energy landscape as described in thermodynamic principles. These factors can either enhance or disrupt stabilizing interactions such as bonds, hydrophobic effects, and electrostatic forces within the protein structure. The of the surrounding environment significantly affects protein stability by protonating or deprotonating ionizable residues, which in turn influences electrostatic interactions like salt bridges and charge repulsion. At extreme values, such as highly acidic or basic conditions, the net charge on the protein can increase, leading to repulsion between like-charged residues and subsequent unfolding. For instance, many proteins exhibit optimal stability near their , where the net charge is minimized, reducing electrostatic repulsion. , determined by salt concentration, modulates these electrostatic effects by screening charges through Debye-Hückel interactions; low ionic strength enhances charge-charge attractions that stabilize salt bridges, while high ionic strength can weaken them, potentially destabilizing the structure. In monoclonal antibodies, for example, increasing from low to moderate levels often stabilizes the folded state by reducing unfavorable repulsions. Temperature exerts a profound influence on protein stability, with elevated temperatures promoting thermal denaturation by increasing molecular motion and disrupting weak non-covalent interactions, leading to unfolding above a characteristic melting temperature (Tm). Conversely, cold denaturation occurs at low temperatures, where the hydrophobic effect weakens due to reduced gain upon burial of nonpolar residues, destabilizing the core. affects stability through volumetric changes; high hydrostatic favors the unfolded state by compressing voids in the protein structure and promoting penetration, as seen in pressure-induced denaturation of globular proteins. This is particularly relevant for deep-sea organisms, where pressures exceed 100 MPa, yet adapted proteins maintain integrity via compact folding. Ligands and cofactors play crucial roles in stabilizing proteins by binding to specific sites, often rigidifying the structure and shifting the equilibrium toward the folded state. Small-molecule ligands can form additional bonds or hydrophobic contacts, enhancing overall stability, as demonstrated in screening methods that identify stabilizing additives for therapeutic proteins. Metal ions, such as or calcium, serve as cofactors that coordinate with residues in the or core, bridging distant parts of the polypeptide chain and preventing unfolding; for example, in zinc-finger proteins, metal binding increases thermal stability by up to 20-30°C. These bound states are essential for enzymes like , where cofactor absence leads to rapid degradation. Mutations alter protein stability by modifying intramolecular interactions, with effects ranging from stabilizing to destabilizing depending on their location and nature. Core mutations that improve packing density, such as replacing a smaller residue with a bulkier one, can enhance hydrophobic interactions and increase stability, as observed in engineered variants of T4 lysozyme. Conversely, surface mutations introducing charged mismatches or disrupting hydrogen bonds often destabilize the structure, contributing to diseases like via misfolding of the CFTR protein. Single-point mutations typically follow a Gaussian distribution in their stability impact, with most causing modest destabilization due to the marginal stability of wild-type proteins. Post-translational modifications, particularly , contribute to stability by adding moieties that shield hydrophobic regions, promote proper folding, and resist proteolytic degradation. N-linked , for instance, stabilizes glycoproteins like immunoglobulins by increasing and reducing aggregation propensity through steric hindrance. In therapeutic monoclonal antibodies, at specific sites enhances thermal stability by modulating surface charge and hydrogen bonding networks. This modification is critical in eukaryotic proteins, where its absence often leads to stress and degradation. In extremophiles, adaptations enhance protein stability under harsh conditions; thermophilic proteins from organisms like often feature increased bonds, which covalently link distant cysteines to rigidify the and resist unfolding. These proteins also exhibit higher charged residue content on surfaces to strengthen salt bridges and more compact cores with optimized hydrophobic packing, allowing function at temperatures above 80°C. Such adaptations, evolved through selection for , include reduced content to limit flexibility, as seen in hyperthermophilic archaeal enzymes.

Experimental determination of protein structures

Biophysical techniques

Biophysical techniques for determining protein structures rely on physical principles that probe atomic arrangements through interactions of with or fields, enabling the reconstruction of three-dimensional models from experimental data. These methods exploit phenomena such as , , and magnetic to generate signals that, when analyzed, yield distributions or positional coordinates of atoms within proteins. The foundational goal is to achieve sufficient resolution to distinguish atomic features, typically measured in angstroms (), where high-resolution data allows for precise placement of individual atoms, while lower-resolution outputs provide overall shapes and secondary elements. Resolution in protein structure determination refers to the smallest distance between features that can be reliably distinguished, with atomic resolution generally considered below 3 , enabling the visualization of side-chain orientations and patterns, whereas resolutions above 4 are low and reveal only the protein's gross architecture, such as domain arrangements. For instance, structures at 1-2 allow unambiguous atom tracing, akin to seeing individual beads on a string, while low-resolution maps at 5-10 resemble fuzzy outlines. is a critical aspect, varying by technique: crystalline states are required for methods like diffraction to produce ordered lattices for wave interference, solution states for (NMR) to maintain native dynamics in liquid environments, and frozen hydrated states for cryo-electron microscopy to preserve biomolecules in near-native conditions without crystals. The core principles of and underpin many techniques, where incident waves (e.g., X-rays or electrons) interact with the electrons in protein atoms, producing interference patterns that encode spatial information. These patterns are mathematically transformed via into maps, which depict regions of high electron concentration corresponding to atomic positions, guided by the protein's known . In scattering approaches, such as , the overall shape is inferred from low-angle deflections without needing atomic detail. Limitations persist, including the phase problem in , where diffraction intensities are measured but phase information is lost, requiring indirect methods like isomorphous replacement for reconstruction, and size constraints in NMR, typically limited to proteins under 50 kDa due to signal broadening from slower tumbling in larger molecules. To overcome individual technique shortcomings, hybrid approaches integrate data from multiple sources for more complete models, such as combining low-resolution envelopes from with high-resolution fragments from to assemble full structures of large complexes. These integrative methods use computational frameworks to fit and validate components against complementary datasets, enhancing accuracy for dynamic or heterogeneous systems. A pivotal historical milestone was the determination of the first protein structure, , at 6 Å resolution in 1958 by and colleagues using , marking the advent of atomic-level insights into globular proteins and earning Kendrew the 1962 . Subsequent refinements to 2 Å in 1960 solidified the alpha-helical fold, revolutionizing .

Key experimental methods

X-ray crystallography remains the most widely used technique for determining high-resolution protein structures, accounting for over 80% of entries in structural databases as of 2023. The process begins with the challenging task of growing well-ordered protein crystals, often requiring extensive optimization of conditions such as , , and precipitant concentrations. Once crystals are obtained, they are exposed to a beam of X-rays, which scatter off the atoms to produce diffraction patterns; these patterns are analyzed using mathematical methods like Fourier transforms to reconstruct maps. Atomic models are then built into these maps and refined iteratively, often achieving resolutions better than 2 Å for small to medium-sized proteins, as exemplified by the structure of solved at 2 Å resolution in 1960. Nuclear magnetic resonance (NMR) spectroscopy complements X-ray crystallography by providing structures of proteins in solution, which more closely mimic physiological conditions. It relies on measuring nuclear Overhauser effects (NOEs) to identify spatial proximities between atoms, typically within 5 Å, along with restraints from coupling constants and data to define secondary structures. For proteins up to about 50 kDa, multidimensional NMR experiments—such as 3D or 4D heteronuclear methods—enable assignment of resonances and structure calculation using restrained simulations, yielding ensembles that capture conformational flexibility; also provide insights into dynamics on to timescales. Cryo-electron microscopy (cryo-EM) has undergone a "resolution revolution" since the 2010s, driven by advances in direct electron detectors, phase plates, and computational image processing, enabling routine determination of structures at near-atomic resolution (better than 3 Å); as of 2025, resolutions better than 2 Å are increasingly routine. In single-particle cryo-EM, purified proteins are flash-frozen in vitreous ice to preserve native states, imaged at cryogenic temperatures to minimize beam damage, and thousands of particle projections are aligned and averaged using algorithms like RELION or cryoSPARC to reconstruct 3D density maps. This method excels for large macromolecular complexes, such as ribosomes or viral particles exceeding 500 kDa, where it has resolved structures like the 3.4 Å map of the human γ-secretase complex in 2015. Small-angle X-ray scattering (SAXS) offers a lower-resolution (typically 10-50 ) but versatile approach for probing overall protein shapes, flexibility, and assemblies in solution, particularly for disordered or heterogeneous systems unsuitable for high-resolution methods. SAXS measures the scattering of X-rays at small angles to derive parameters like the (R_g) and maximum dimension (D_max), which inform on global architecture; for example, it has been used to model the elongated shape of like α-synuclein. Data analysis often involves ab initio modeling or ensemble optimization to fit scattering profiles, providing complementary information to high-resolution techniques. Each method has distinct strengths: X-ray crystallography delivers the highest precision for rigid, crystallizable proteins but requires crystals that may trap non-native conformations; NMR uniquely captures solution dynamics and is ideal for small, flexible proteins but struggles with sizes above 50 kDa; cryo-EM is transformative for large, dynamic complexes in near-native states without , though it demands high sample purity and can suffer from preferred orientations. These techniques are often integrated—for instance, using NMR or SAXS to validate cryo-EM models—to provide a more complete structural picture. Recent advances in time-resolved methods have enabled visualization of transient protein folding intermediates, bridging with dynamics. Time-resolved serial femtosecond crystallography (TR-SFX) at free-electron lasers captures snapshots of folding pathways by mixing proteins with triggers like temperature jumps, as demonstrated in resolving intermediates of a photoreceptor at sub-microsecond timescales. Similarly, time-resolved cryo-EM, using microfluidic mixing devices, has imaged of proteins on millisecond scales, revealing compaction and secondary structure formation. These developments, accelerated post-2020, leverage and AI for to study folding mechanisms in real time.

Protein structure resources

Databases

The (PDB) serves as the primary global repository for experimentally determined three-dimensional structures of proteins, nucleic acids, and complex assemblies. Established in 1971 at under the leadership of Walter Hamilton, it began with just seven structures and has since grown into a foundational resource for . The PDB adopts the macromolecular (mmCIF) as its standard format, which supports detailed annotations for atomic coordinates, experimental metadata, and validation reports, enabling interoperability with various software tools. As of 2025, the archive contains over 244,000 entries, reflecting annual releases of around 12,000 to 14,000 structures in recent years. Complementing the PDB are specialized databases that archive complementary data from specific experimental techniques. The Electron Microscopy Data Bank (EMDB), established in 2002, stores three-dimensional density maps derived from electron microscopy reconstructions, including high-resolution cryo-EM volumes of macromolecular complexes and subcellular structures. Similarly, the Biological Magnetic Resonance Bank (BMRB) collects, annotates, and disseminates spectral and quantitative data from () of biological macromolecules, such as chemical shifts and relaxation parameters for proteins and nucleic acids. To ensure data reliability, deposited structures undergo rigorous validation using specialized tools. MolProbity, for instance, performs all-atom contact analysis to identify steric clashes, Ramachandran outliers, and side-chain rotamer errors, providing clashscores and percentile rankings for quality assessment. WHAT IF offers comprehensive checks on geometry, hydrogen bonding, and packing density, aiding in the refinement of models before deposition. These tools are integrated into the deposition pipelines of the PDB and its partners, promoting high standards across the archive. Access to these databases is facilitated through user-friendly interfaces, programmatic APIs, and visualization software. The RCSB PDB provides RESTful web services and APIs for querying entries by , , or experimental method, enabling automated data retrieval for large-scale analyses. Popular visualization tools include PyMOL, an open-source system for rendering atomic models with ray-tracing capabilities, and , which supports interactive analysis of structures alongside density maps and trajectories. The growth of the PDB has accelerated with advances in experimental techniques, yet computational predictions like those from have influenced trends by providing hypotheses that guide and validate new depositions without supplanting experimental efforts. The Protein Structure Database (AFDB), released by EMBL-EBI, complements experimental resources by offering predicted structures for over 200 million proteins from various organisms, aiding in hypothesis generation and filling gaps in experimental data. Despite this expansion, challenges persist, including incomplete coverage of certain protein classes; for example, membrane proteins remain underrepresented due to difficulties in and stability, comprising less than 5% of PDB entries.

Structural classifications

Structural classifications of proteins organize known three-dimensional structures into hierarchical schemes based on similarities in folding patterns and evolutionary relationships, facilitating the understanding of protein architecture across diverse biological contexts. These systems, such as SCOP and CATH, provide frameworks for grouping protein domains or entire proteins, enabling researchers to identify common structural motifs that often correlate with shared functions or ancestry. By categorizing structures at multiple levels, from broad secondary structure composition to specific evolutionary lineages, these classifications reveal patterns in protein evolution and aid in annotating uncharacterized proteins. The Structural Classification of Proteins (SCOP) database employs a manually curated to classify protein domains according to their structural and evolutionary relationships. At the highest level, proteins are divided into classes based on secondary structure content, such as all-alpha proteins (dominated by alpha-helices, exemplified by globins like ), all-beta proteins (composed mainly of beta-sheets, as seen in immunoglobulin domains), alpha/beta proteins (alternating alpha-helices and beta-strands, like the Rossmann in dehydrogenases), and alpha+beta proteins (segregated alpha and beta regions). Subsequent levels include (describing the overall topology without implying homology), superfamily (groups sharing a common evolutionary origin with low similarity but structural conservation), and (closely related proteins with high identity). This four-tiered structure—class, , superfamily, —extends to protein and levels for finer , encompassing over 100,000 domains in recent releases. SCOP's manual curation, involving expert visual inspection of structures alongside and functional data, ensures high accuracy in delineating evolutionary links. In contrast, the Class, Architecture, Topology, and Homologous superfamily (CATH) database focuses exclusively on protein domains and uses a semi-automated approach to generate its hierarchy. The class level mirrors SCOP's, grouping by secondary structure predominance (e.g., mainly alpha, mainly beta, alpha-beta), while architecture describes the gross orientation of secondary elements without connectivity details, such as the barrel or sandwich arrangements in beta proteins. Topology (or fold family) specifies the connectivity and packing of these elements, and the homologous superfamily level clusters domains with evidence of shared ancestry, often supported by sequence or structural alignments. CATH classifies hundreds of thousands of domains, emphasizing domain-level granularity over whole proteins. Unlike SCOP's predominantly manual process, CATH integrates automated clustering algorithms with human oversight, allowing for scalable updates and reducing subjectivity in topology assignments. Key differences between and CATH arise from their methodologies and scopes: SCOP prioritizes evolutionary inference through manual integration of structural, sequence, and functional evidence across entire proteins, resulting in a more conservative , whereas CATH's domain-centric, semi-automated pipeline enables broader coverage and faster incorporation of new structures, though it may introduce minor discrepancies in superfamily assignments due to algorithmic thresholds. Both systems serve complementary roles, with SCOP favored for detailed evolutionary studies and CATH for high-throughput domain analysis. These classifications underpin applications in evolutionary inference, where superfamily groupings highlight from common ancestors despite divergence, as seen in the fold shared across enzymes from to eukaryotes. They also enable function prediction by leveraging structural similarity; for instance, assigning a domain to a known superfamily can infer catalytic roles based on conserved active sites, improving accuracy in projects. Post-2020, CATH has significantly expanded by incorporating AI-predicted structures from , adding over 150 million domains from 21 model organisms to enhance coverage of understudied superfamilies and support variant interpretation in disease research. SCOP updates, through its extended version SCOPe, have similarly increased structural coverage to nearly all superfamilies, though with less emphasis on predicted models to maintain reliance on experimental data.

Computational prediction of protein structure

Template-based methods

Template-based methods, also known as comparative or , predict the three-dimensional structure of a target protein by leveraging structural templates from evolutionarily related proteins with known structures. This approach assumes that homologous proteins share similar folds, allowing the transfer of structural information from templates to the target sequence. The method is particularly effective when the target shares significant sequence similarity with existing structures in databases like the (PDB). The homology modeling pipeline typically begins with template selection, where the target sequence is searched against structural databases to identify suitable templates. Tools such as PSI-BLAST, which uses position-specific scoring matrices to detect distant homologs, or HHpred, which employs profile (HMM) comparisons for sensitive homology detection, are commonly used for this step. PSI-BLAST iteratively refines searches to capture weak similarities, while HHpred excels in aligning query and template profiles to identify remote homologs with low sequence identity. Once templates are selected, is performed to map the target residues onto the template backbone, often using tools like Clustal Omega or the alignment modules in modeling software. Model building follows, where atomic coordinates are derived by copying conserved regions from the template and modeling variable loops and side chains, typically via satisfaction of spatial restraints derived from the alignment and statistical potentials. Refinement optimizes the model through energy minimization or to resolve clashes and improve . For cases with low sequence similarity (<30%), threading methods extend homology modeling by aligning the target sequence to fold templates without relying on high sequence identity. Threading evaluates the compatibility of the target sequence with template structures using energy-based potentials that consider burial, secondary structure propensity, and pairwise interactions, often ranking alignments by a threading score. Seminal work demonstrated that threading can successfully recognize protein folds by optimizing sequence-structure fitness, even for proteins with sequence identities as low as 10-20%. The accuracy of template-based models correlates strongly with the sequence identity between target and template; models with >30% identity typically achieve backbone root-mean-square deviation (RMSD) values below 1 to the native structure, enabling reliable prediction of core folds. Below 30% identity, accuracy declines, with RMSD often exceeding 3 due to alignment errors and loop inaccuracies, as established in analyses of homologous protein pairs. Widely adopted tools for include MODELLER, which implements restraint-based modeling to generate and refine structures from alignments, and , a fully automated server that integrates template search, alignment, and quality assessment for high-throughput predictions. These tools have been benchmarked in community experiments like , where they perform well for targets with detectable templates. A key limitation of template-based methods is their dependence on available templates; they perform poorly for proteins with novel folds not represented in structural , where no suitable homologs exist. In contrast to de novo methods, which build structures from physicochemical principles without templates, homology modeling requires evolutionary relatedness for success. Models are validated by comparing predicted structures to native ones (if available) using RMSD on Cα atoms, where values <2 Å indicate high fidelity, and by stereochemical checks such as Ramachandran plot analysis to ensure backbone dihedral angles fall within allowed regions. Additional metrics like global distance test (GDT) scores and energy profiles further assess overall quality.

De novo and AI-driven prediction

De novo protein structure prediction, also known as ab initio prediction, aims to determine a protein's three-dimensional structure solely from its amino acid sequence without relying on known homologs or templates. These methods typically involve assembling short structural fragments derived from sequence patterns and refining them through energy minimization to identify low-energy conformations. A prominent example is the Rosetta protocol, which uses fragment assembly followed by Monte Carlo sampling and energy-based optimization to generate plausible folds. employs empirical potential functions, such as Lennard-Jones terms for van der Waals interactions and statistical potentials derived from known structures, to score and minimize the energy of assembled models. Physics-based approaches complement fragment assembly by simulating the folding process through molecular dynamics (MD) simulations, which model atomic interactions using classical force fields to trace folding trajectories. These simulations capture thermodynamic principles like entropy-driven collapse and hydrogen bonding stabilization, providing insights into folding pathways for small proteins. For instance, all-atom MD has successfully folded peptides and miniproteins in microseconds-scale simulations, revealing funnel-like energy landscapes guiding native states. However, full-scale MD for larger proteins remains computationally intensive due to the timescales involved, often requiring enhanced sampling techniques like replica-exchange MD. The advent of artificial intelligence has revolutionized de novo prediction, with deep learning models leveraging multiple sequence alignments (MSAs) to infer evolutionary constraints and structural propensities. AlphaFold 2, developed by DeepMind, marked a breakthrough in the 2020 CASP14 competition, achieving unprecedented accuracy by using attention-based neural networks to predict residue-residue distances and angles directly from sequence data. This end-to-end approach bypasses traditional intermediate steps like fragment threading, instead training on vast structural databases to output atomic models with confidence scores (pLDDT). Similarly, RoseTTAFold from the Baker lab introduced a three-track neural network architecture that processes sequence, 2D distance maps, and 3D coordinates in parallel, enabling rapid predictions comparable to for single chains and complexes. Building on this, DeepMind released AlphaFold 3 in May 2024, which employs a diffusion-based architecture to predict joint structures of biomolecular complexes, including interactions with DNA, RNA, ligands, and modifications, substantially advancing applications in drug discovery and biology. This work earned Demis Hassabis and John Jumper the 2024 Nobel Prize in Chemistry for computational protein structure prediction. Accuracy benchmarks highlight the impact of these AI methods; in CASP14, 2 attained a median Global Distance Test-Total Score (GDT-TS) of 92.4, surpassing human expert levels for many targets and enabling near-atomic resolution (RMSD < 1 Å) for proteins up to 400 residues. These tools have expanded coverage of the "dark proteome"—regions lacking experimental structures—with predicting high-confidence models (pLDDT > 90) for about 37% of residues in structurally uncharacterized domains. In CASP16 (2024), top AI-driven methods, including variants of 3, achieved even higher median GDT-TS scores (around 85-95 for monomers and multimers) and improved prediction, further demonstrating near-solved status for many protein structure challenges. Hybrid approaches occasionally incorporate sparse template information from databases to refine novel folds, but AI-driven methods dominate for orphan proteins. The Protein Structure Database was updated in 2024 to include predictions for over 200 million proteins across eukaryotes, , and , enhancing global accessibility. Looking ahead, integrating AI predictions with dynamics simulations promises more realistic models that account for conformational flexibility beyond static structures. Emerging frameworks combine AlphaFold-like outputs as starting points for to generate Boltzmann-distributed ensembles, aiding for flexible targets like enzymes. This synergy could address limitations in capturing transient states, with ongoing efforts focusing on generative AI for sampling diverse conformations efficiently.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.