Hubbry Logo
Protein designProtein designMain
Open search
Protein design
Community hub
Protein design
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Protein design
Protein design
from Wikipedia

Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function.[1] Proteins can be designed from scratch (de novo design) or by making calculated variants of a known protein structure and its sequence (termed protein redesign). Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as peptide synthesis, site-directed mutagenesis, or artificial gene synthesis.

Rational protein design dates back to the mid-1970s.[2] Recently, however, there were numerous examples of successful rational design of water-soluble and even transmembrane peptides and proteins, in part due to a better understanding of different factors contributing to protein structure stability and development of better computational methods.

Overview and history

[edit]

The goal in rational protein design is to predict amino acid sequences that will fold to a specific protein structure. Although the number of possible protein sequences is vast, growing exponentially with the size of the protein chain, only a subset of them will fold reliably and quickly to one native state. Protein design involves identifying novel sequences within this subset. The native state of a protein is the conformational free energy minimum for the chain. Thus, protein design is the search for sequences that have the chosen structure as a free energy minimum. In a sense, it is the reverse of protein structure prediction. In design, a tertiary structure is specified, and a sequence that will fold to it is identified. Hence, it is also termed inverse folding. Protein design is then an optimization problem: using some scoring criteria, an optimized sequence that will fold to the desired structure is chosen.

When the first proteins were rationally designed during the 1970s and 1980s, the sequence for these was optimized manually based on analyses of other known proteins, the sequence composition, amino acid charges, and the geometry of the desired structure.[2] The first designed proteins are attributed to Bernd Gutte, who designed a reduced version of a known catalyst, bovine ribonuclease, and tertiary structures consisting of beta-sheets and alpha-helices, including a binder of DDT. Urry and colleagues later designed elastin-like fibrous peptides based on rules on sequence composition. Richardson and coworkers designed a 79-residue protein with no sequence homology to a known protein.[2] In the 1990s, the advent of powerful computers, libraries of amino acid conformations, and force fields developed mainly for molecular dynamics simulations enabled the development of structure-based computational protein design tools. Following the development of these computational tools, great success has been achieved over the last 30 years in protein design. The first protein successfully designed completely de novo was done by Stephen Mayo and coworkers in 1997,[3] and, shortly after, in 1999 Peter S. Kim and coworkers designed dimers, trimers, and tetramers of unnatural right-handed coiled coils.[4][5] In 2003, David Baker's laboratory designed a full protein to a fold never seen before in nature.[6] Later, in 2008, Baker's group computationally designed enzymes for two different reactions.[7] In 2010, one of the most powerful broadly neutralizing antibodies was isolated from patient serum using a computationally designed protein probe.[8] In 2024, Baker received one half of the Nobel Prize in Chemistry for his advancement of computational protein design, with the other half being shared by Demis Hassabis and John Jumper of Deepmind for protein structure prediction.[9] Due to these and other successes (e.g., see examples below), protein design has become one of the most important tools available for protein engineering. There is great hope that the design of new proteins, small and large, will have uses in biomedicine and bioengineering.

Underlying models of protein structure and function

[edit]

Protein design programs use computer models of the molecular forces that drive proteins in in vivo environments. In order to make the problem tractable, these forces are simplified by protein design models. Although protein design programs vary greatly, they have to address four main modeling questions: What is the target structure of the design, what flexibility is allowed on the target structure, which sequences are included in the search, and which force field will be used to score sequences and structures.

Target structure

[edit]
The Top7 protein was one of the first proteins designed for a fold that had never been seen before in nature[6]

Protein function is heavily dependent on protein structure, and rational protein design uses this relationship to design function by designing proteins that have a target structure or fold. Thus, by definition, in rational protein design the target structure or ensemble of structures must be known beforehand. This contrasts with other forms of protein engineering, such as directed evolution, where a variety of methods are used to find proteins that achieve a specific function, and with protein structure prediction where the sequence is known, but the structure is unknown.

Most often, the target structure is based on a known structure of another protein. However, novel folds not seen in nature have been made increasingly possible. Peter S. Kim and coworkers designed trimers and tetramers of unnatural coiled coils, which had not been seen before in nature.[4][5] The protein Top7, developed in David Baker's lab, was designed completely using protein design algorithms, to a completely novel fold.[6] More recently, Baker and coworkers developed a series of principles to design ideal globular-protein structures based on protein folding funnels that bridge between secondary structure prediction and tertiary structures. These principles, which build on both protein structure prediction and protein design, were used to design five different novel protein topologies.[10]

Sequence space

[edit]
FSD-1 (shown in blue, PDB id: 1FSV) was the first de novo computational design of a full protein.[3] The target fold was that of the zinc finger in residues 33–60 of the structure of protein Zif268 (shown in red, PDB id: 1ZAA). The designed sequence had very little sequence identity with any known protein sequence.

In rational protein design, proteins can be redesigned from the sequence and structure of a known protein, or completely from scratch in de novo protein design. In protein redesign, most of the residues in the sequence are maintained as their wild-type amino-acid while a few are allowed to mutate. In de novo design, the entire sequence is designed anew, based on no prior sequence.

Both de novo designs and protein redesigns can establish rules on the sequence space: the specific amino acids that are allowed at each mutable residue position. For example, the composition of the surface of the RSC3 probe to select HIV-broadly neutralizing antibodies was restricted based on evolutionary data and charge balancing. Many of the earliest attempts on protein design were heavily based on empiric rules on the sequence space.[2] Moreover, the design of fibrous proteins usually follows strict rules on the sequence space. Collagen-based designed proteins, for example, are often composed of Gly-Pro-X repeating patterns.[2] The advent of computational techniques allows designing proteins with no human intervention in sequence selection.[3]

Structural flexibility

[edit]
Common protein design programs use rotamer libraries to simplify the conformational space of protein side chains. This animation loops through all the rotamers of the isoleucine amino acid based on the Penultimate Rotamer Library (total of 7 rotamers).[11]

In protein design, the target structure (or structures) of the protein are known. However, a rational protein design approach must model some flexibility on the target structure in order to increase the number of sequences that can be designed for that structure and to minimize the chance of a sequence folding to a different structure. For example, in a protein redesign of one small amino acid (such as alanine) in the tightly packed core of a protein, very few mutants would be predicted by a rational design approach to fold to the target structure, if the surrounding side-chains are not allowed to be repacked.

Thus, an essential parameter of any design process is the amount of flexibility allowed for both the side-chains and the backbone. In the simplest models, the protein backbone is kept rigid while some of the protein side-chains are allowed to change conformations. However, side-chains can have many degrees of freedom in their bond lengths, bond angles, and χ dihedral angles. To simplify this space, protein design methods use rotamer libraries that assume ideal values for bond lengths and bond angles, while restricting χ dihedral angles to a few frequently observed low-energy conformations termed rotamers.

Rotamer libraries are derived from the statistical analysis of many protein structures. Backbone-independent rotamer libraries describe all rotamers.[11] Backbone-dependent rotamer libraries, in contrast, describe the rotamers as how likely they are to appear depending on the protein backbone arrangement around the side chain.[12] Most protein design programs use one conformation (e.g., the modal value for rotamer dihedrals in space) or several points in the region described by the rotamer; the OSPREY protein design program, in contrast, models the entire continuous region.[13]

Although rational protein design must preserve the general backbone fold a protein, allowing some backbone flexibility can significantly increase the number of sequences that fold to the structure while maintaining the general fold of the protein.[14] Backbone flexibility is especially important in protein redesign because sequence mutations often result in small changes to the backbone structure. Moreover, backbone flexibility can be essential for more advanced applications of protein design, such as binding prediction and enzyme design. Some models of protein design backbone flexibility include small and continuous global backbone movements, discrete backbone samples around the target fold, backrub motions, and protein loop flexibility.[14][15]

Energy function

[edit]
Comparison of various potential energy functions. The most accurate energy are those that use quantum mechanical calculations, but these are too slow for protein design. On the other extreme, heuristic energy functions are based on statistical terms and are very fast. In the middle are molecular mechanics energy functions that are physically based but are not as computationally expensive as quantum mechanical simulations.[16]

Rational protein design techniques must be able to discriminate sequences that will be stable under the target fold from those that would prefer other low-energy competing states. Thus, protein design requires accurate energy functions that can rank and score sequences by how well they fold to the target structure. At the same time, however, these energy functions must consider the computational challenges behind protein design. One of the most challenging requirements for successful design is an energy function that is both accurate and simple for computational calculations.

The most accurate energy functions are those based on quantum mechanical simulations. However, such simulations are too slow and typically impractical for protein design. Instead, many protein design algorithms use either physics-based energy functions adapted from molecular mechanics simulation programs, knowledge based energy-functions, or a hybrid mix of both. The trend has been toward using more physics-based potential energy functions.[16]

Physics-based energy functions, such as AMBER and CHARMM, are typically derived from quantum mechanical simulations, and experimental data from thermodynamics, crystallography, and spectroscopy.[17] These energy functions typically simplify physical energy function and make them pairwise decomposable, meaning that the total energy of a protein conformation can be calculated by adding the pairwise energy between each atom pair, which makes them attractive for optimization algorithms. Physics-based energy functions typically model an attractive-repulsive Lennard-Jones term between atoms and a pairwise electrostatics coulombic term[18] between non-bonded atoms.

Water-mediated hydrogen bonds play a key role in protein–protein binding. One such interaction is shown between residues D457, S365 in the heavy chain of the HIV-broadly-neutralizing antibody VRC01 (green) and residues N58 and Y59 in the HIV envelope protein GP120 (purple).[19]

Statistical potentials, in contrast to physics-based potentials, have the advantage of being fast to compute, of accounting implicitly of complex effects and being less sensitive to small changes in the protein structure.[20] These energy functions are based on deriving energy values from frequency of appearance on a structural database.

Protein design, however, has requirements that can sometimes be limited in molecular mechanics force-fields. Molecular mechanics force-fields, which have been used mostly in molecular dynamics simulations, are optimized for the simulation of single sequences, but protein design searches through many conformations of many sequences. Thus, molecular mechanics force-fields must be tailored for protein design. In practice, protein design energy functions often incorporate both statistical terms and physics-based terms. For example, the Rosetta energy function, one of the most-used energy functions, incorporates physics-based energy terms originating in the CHARMM energy function, and statistical energy terms, such as rotamer probability and knowledge-based electrostatics. Typically, energy functions are highly customized between laboratories, and specifically tailored for every design.[17]

Challenges for effective design energy functions

[edit]

Water makes up most of the molecules surrounding proteins and is the main driver of protein structure. Thus, modeling the interaction between water and protein is vital in protein design. The number of water molecules that interact with a protein at any given time is huge and each one has a large number of degrees of freedom and interaction partners. Instead, protein design programs model most of such water molecules as a continuum, modeling both the hydrophobic effect and solvation polarization.[17]

Individual water molecules can sometimes have a crucial structural role in the core of proteins, and in protein–protein or protein–ligand interactions. Failing to model such waters can result in mispredictions of the optimal sequence of a protein–protein interface. As an alternative, water molecules can be added to rotamers. [17]


As an optimization problem

[edit]
This animation illustrates the complexity of a protein design search, which typically compares all the rotamer-conformations from all possible mutations at all residues. In this example, the residues Phe36 and His 106 are allowed to mutate to, respectively, the amino acids Tyr and Asn. Phe and Tyr have 4 rotamers each in the rotamer library, while Asn and His have 7 and 8 rotamers, respectively, in the rotamer library (from the Richardson's penultimate rotamer library[11]). The animation loops through all (4 + 4) x (7 + 8) = 120 possibilities. The structure shown is that of myoglobin, PDB id: 1mbn.

The goal of protein design is to find a protein sequence that will fold to a target structure. A protein design algorithm must, thus, search all the conformations of each sequence, with respect to the target fold, and rank sequences according to the lowest-energy conformation of each one, as determined by the protein design energy function. Thus, a typical input to the protein design algorithm is the target fold, the sequence space, the structural flexibility, and the energy function, while the output is one or more sequences that are predicted to fold stably to the target structure.

The number of candidate protein sequences, however, grows exponentially with the number of protein residues; for example, there are 20100 protein sequences of length 100. Furthermore, even if amino acid side-chain conformations are limited to a few rotamers (see Structural flexibility), this results in an exponential number of conformations for each sequence. Thus, in our 100 residue protein, and assuming that each amino acid has exactly 10 rotamers, a search algorithm that searches this space will have to search over 200100 protein conformations.

The most common energy functions can be decomposed into pairwise terms between rotamers and amino acid types, which casts the problem as a combinatorial one, and powerful optimization algorithms can be used to solve it. In those cases, the total energy of each conformation belonging to each sequence can be formulated as a sum of individual and pairwise terms between residue positions. If a designer is interested only in the best sequence, the protein design algorithm only requires the lowest-energy conformation of the lowest-energy sequence. In these cases, the amino acid identity of each rotamer can be ignored and all rotamers belonging to different amino acids can be treated the same. Let ri be a rotamer at residue position i in the protein chain, and E(ri) the potential energy between the internal atoms of the rotamer. Let E(ri, rj) be the potential energy between ri and rotamer rj at residue position j. Then, we define the optimization problem as one of finding the conformation of minimum energy (ET):

The problem of minimizing ET is an NP-hard problem.[15][21][22] Even though the class of problems is NP-hard, in practice many instances of protein design can be solved exactly or optimized satisfactorily through heuristic methods.

Algorithms

[edit]

Several algorithms have been developed specifically for the protein design problem. These algorithms can be divided into two broad classes: exact algorithms, such as dead-end elimination, that lack runtime guarantees but guarantee the quality of the solution; and heuristic algorithms, such as Monte Carlo, that are faster than exact algorithms but have no guarantees on the optimality of the results. Exact algorithms guarantee that the optimization process produced the optimal according to the protein design model. Thus, if the predictions of exact algorithms fail when these are experimentally validated, then the source of error can be attributed to the energy function, the allowed flexibility, the sequence space or the target structure (e.g., if it cannot be designed for).[23]

Some protein design algorithms are listed below. Although these algorithms address only the most basic formulation of the protein design problem, Equation (1), when the optimization goal changes because designers introduce improvements and extensions to the protein design model, such as improvements to the structural flexibility allowed (e.g., protein backbone flexibility) or including sophisticated energy terms, many of the extensions on protein design that improve modeling are built atop these algorithms. For example, Rosetta Design incorporates sophisticated energy terms, and backbone flexibility using Monte Carlo as the underlying optimizing algorithm. OSPREY's algorithms build on the dead-end elimination algorithm and A* to incorporate continuous backbone and side-chain movements. Thus, these algorithms provide a good perspective on the different kinds of algorithms available for protein design.

In 2020 scientists reported the development of an AI-based process using genome databases for evolution-based designing of novel proteins. They used deep learning to identify design-rules.[24][25] In 2022, a study reported deep learning software that can design proteins that contain pre-specified functional sites.[26][27]

With mathematical guarantees

[edit]

Dead-end elimination

[edit]

The dead-end elimination (DEE) algorithm reduces the search space of the problem iteratively by removing rotamers that can be provably shown to be not part of the global lowest energy conformation (GMEC). On each iteration, the dead-end elimination algorithm compares all possible pairs of rotamers at each residue position, and removes each rotamer r′i that can be shown to always be of higher energy than another rotamer ri and is thus not part of the GMEC:

Other powerful extensions to the dead-end elimination algorithm include the pairs elimination criterion, and the generalized dead-end elimination criterion. This algorithm has also been extended to handle continuous rotamers with provable guarantees.

Although the Dead-end elimination algorithm runs in polynomial time on each iteration, it cannot guarantee convergence. If, after a certain number of iterations, the dead-end elimination algorithm does not prune any more rotamers, then either rotamers have to be merged or another search algorithm must be used to search the remaining search space. In such cases, the dead-end elimination acts as a pre-filtering algorithm to reduce the search space, while other algorithms, such as A*, Monte Carlo, Linear Programming, or FASTER are used to search the remaining search space.[15]

Branch and bound

[edit]

The protein design conformational space can be represented as a tree, where the protein residues are ordered in an arbitrary way, and the tree branches at each of the rotamers in a residue. Branch and bound algorithms use this representation to efficiently explore the conformation tree: At each branching, branch and bound algorithms bound the conformation space and explore only the promising branches.[15][28][29]

A popular search algorithm for protein design is the A* search algorithm.[15][29] A* computes a lower-bound score on each partial tree path that lower bounds (with guarantees) the energy of each of the expanded rotamers. Each partial conformation is added to a priority queue and at each iteration the partial path with the lowest lower bound is popped from the queue and expanded. The algorithm stops once a full conformation has been enumerated and guarantees that the conformation is the optimal.

The A* score f in protein design consists of two parts, f=g+h. g is the exact energy of the rotamers that have already been assigned in the partial conformation. h is a lower bound on the energy of the rotamers that have not yet been assigned. Each is designed as follows, where d is the index of the last assigned residue in the partial conformation.

Integer linear programming

[edit]

The problem of optimizing ET (Equation (1)) can be easily formulated as an integer linear program (ILP).[30] One of the most powerful formulations uses binary variables to represent the presence of a rotamer and edges in the final solution, and constraints the solution to have exactly one rotamer for each residue and one pairwise interaction for each pair of residues:

s.t.

ILP solvers, such as CPLEX, can compute the exact optimal solution for large instances of protein design problems. These solvers use a linear programming relaxation of the problem, where qi and qij are allowed to take continuous values, in combination with a branch and cut algorithm to search only a small portion of the conformation space for the optimal solution. ILP solvers have been shown to solve many instances of the side-chain placement problem.[30]

Message-passing based approximations to the linear programming dual

[edit]

ILP solvers depend on linear programming (LP) algorithms, such as the Simplex or barrier-based methods to perform the LP relaxation at each branch. These LP algorithms were developed as general-purpose optimization methods and are not optimized for the protein design problem (Equation (1)). In consequence, the LP relaxation becomes the bottleneck of ILP solvers when the problem size is large.[31] Recently, several alternatives based on message-passing algorithms have been designed specifically for the optimization of the LP relaxation of the protein design problem. These algorithms can approximate both the dual or the primal instances of the integer programming, but in order to maintain guarantees on optimality, they are most useful when used to approximate the dual of the protein design problem, because approximating the dual guarantees that no solutions are missed. Message-passing based approximations include the tree reweighted max-product message passing algorithm,[32][33] and the message passing linear programming algorithm.[34]

Optimization algorithms without guarantees

[edit]

Monte Carlo and simulated annealing

[edit]

Monte Carlo is one of the most widely used algorithms for protein design. In its simplest form, a Monte Carlo algorithm selects a residue at random, and in that residue a randomly chosen rotamer (of any amino acid) is evaluated.[22] The new energy of the protein, Enew is compared against the old energy Eold and the new rotamer is accepted with a probability of:

where β is the Boltzmann constant and the temperature T can be chosen such that in the initial rounds it is high and it is slowly annealed to overcome local minima.[13]

FASTER

[edit]

The FASTER algorithm uses a combination of deterministic and stochastic criteria to optimize amino acid sequences. FASTER first uses DEE to eliminate rotamers that are not part of the optimal solution. Then, a series of iterative steps optimize the rotamer assignment.[35][36]

Belief propagation

[edit]

In belief propagation for protein design, the algorithm exchanges messages that describe the belief that each residue has about the probability of each rotamer in neighboring residues. The algorithm updates messages on every iteration and iterates until convergence or until a fixed number of iterations. Convergence is not guaranteed in protein design. The message mi→ j(rj that a residue i sends to every rotamer (rj at neighboring residue j is defined as:

Both max-product and sum-product belief propagation have been used to optimize protein design.

Applications and examples of designed proteins

[edit]

Enzyme design

[edit]

The design of new enzymes is a use of protein design with huge bioengineering and biomedical applications. In general, designing a protein structure can be different from designing an enzyme, because the design of enzymes must consider many states involved in the catalytic mechanism. However protein design is a prerequisite of de novo enzyme design because, at the very least, the design of catalysts requires a scaffold in which the catalytic mechanism can be inserted.[37]

Great progress in de novo enzyme design, and redesign, was made in the first decade of the 21st century. In three major studies, David Baker and coworkers de novo designed enzymes for the retro-aldol reaction,[38] a Kemp-elimination reaction,[39] and for the Diels-Alder reaction.[40] Furthermore, Stephen Mayo and coworkers developed an iterative method to design the most efficient known enzyme for the Kemp-elimination reaction.[41] Also, in the laboratory of Bruce Donald, computational protein design was used to switch the specificity of one of the protein domains of the nonribosomal peptide synthetase that produces Gramicidin S, from its natural substrate phenylalanine to other noncognate substrates including charged amino acids; the redesigned enzymes had activities close to those of the wild-type.[42]

Semi-rational design

[edit]

Semi-rational design is a purposeful modification method based on a certain understanding of the sequence, structure, and catalytic mechanism of enzymes. This method is between irrational design and rational design. It uses known information and means to perform evolutionary modification on the specific functions of the target enzyme. The characteristic of semi-rational design is that it does not rely solely on random mutation and screening, but combines the concept of directed evolution. It creates a library of random mutants with diverse sequences through mutagenesis, error-prone RCR, DNA recombination, and site-saturation mutagenesis. At the same time, it uses the understanding of enzymes and design principles to purposefully screen out mutants with desired characteristics.

The methodology of semi-rational design emphasizes the in-depth understanding of enzymes and the control of the evolutionary process. It allows researchers to use known information to guide the evolutionary process, thereby improving efficiency and success rate. This method plays an important role in protein function modification because it can combine the advantages of irrational design and rational design, and can explore unknown space and use known knowledge for targeted modification.

Semi-rational design has a wide range of applications, including but not limited to enzyme optimization, modification of drug targets, evolution of biocatalysts, etc. Through this method, researchers can more effectively improve the functional properties of proteins to meet specific biotechnology or medical needs. Although this method has high requirements for information and technology and is relatively difficult to implement, with the development of computing technology and bioinformatics, the application prospects of semi-rational design in protein engineering are becoming more and more broad.[43]

Design for affinity

[edit]

Protein–protein interactions are involved in most biotic processes. Many of the hardest-to-treat diseases, such as Alzheimer's, many forms of cancer (e.g., TP53), and human immunodeficiency virus (HIV) infection involve protein–protein interactions. Thus, to treat such diseases, it is desirable to design protein or protein-like therapeutics that bind one of the partners of the interaction and, thus, disrupt the disease-causing interaction. This requires designing protein-therapeutics for affinity toward its partner.

Protein–protein interactions can be designed using protein design algorithms because the principles that rule protein stability also rule protein–protein binding. Protein–protein interaction design, however, presents challenges not commonly present in protein design. One of the most important challenges is that, in general, the interfaces between proteins are more polar than protein cores, and binding involves a tradeoff between desolvation and hydrogen bond formation.[44] To overcome this challenge, Bruce Tidor and coworkers developed a method to improve the affinity of antibodies by focusing on electrostatic contributions. They found that, for the antibodies designed in the study, reducing the desolvation costs of the residues in the interface increased the affinity of the binding pair.[44][45][46]

Scoring binding predictions

[edit]

Protein design energy functions must be adapted to score binding predictions because binding involves a trade-off between the lowest-energy conformations of the free proteins (EP and EL) and the lowest-energy conformation of the bound complex (EPL):

.

The K* algorithm approximates the binding constant of the algorithm by including conformational entropy into the free energy calculation. The K* algorithm considers only the lowest-energy conformations of the free and bound complexes (denoted by the sets P, L, and PL) to approximate the partition functions of each complex:[15]

Design for specificity

[edit]

The design of protein–protein interactions must be highly specific because proteins can interact with a large number of proteins; successful design requires selective binders. Thus, protein design algorithms must be able to distinguish between on-target (or positive design) and off-target binding (or negative design).[2][44] One of the most prominent examples of design for specificity is the design of specific bZIP-binding peptides by Amy Keating and coworkers for 19 out of the 20 bZIP families; 8 of these peptides were specific for their intended partner over competing peptides.[44][47][48] Further, positive and negative design was also used by Anderson and coworkers to predict mutations in the active site of a drug target that conferred resistance to a new drug; positive design was used to maintain wild-type activity, while negative design was used to disrupt binding of the drug.[49] Recent computational redesign by Costas Maranas and coworkers was also capable of experimentally switching the cofactor specificity of Candida boidinii xylose reductase from NADPH to NADH.[50]

Protein resurfacing

[edit]

Protein resurfacing consists of designing a protein's surface while preserving the overall fold, core, and boundary regions of the protein intact. Protein resurfacing is especially useful to alter the binding of a protein to other proteins. One of the most important applications of protein resurfacing was the design of the RSC3 probe to select broadly neutralizing HIV antibodies at the NIH Vaccine Research Center. First, residues outside of the binding interface between the gp120 HIV envelope protein and the formerly discovered b12-antibody were selected to be designed. Then, the sequence spaced was selected based on evolutionary information, solubility, similarity with the wild-type, and other considerations. Then the RosettaDesign software was used to find optimal sequences in the selected sequence space. RSC3 was later used to discover the broadly neutralizing antibody VRC01 in the serum of a long-term HIV-infected non-progressor individual.[51]

Design of globular proteins

[edit]

Globular proteins are proteins that contain a hydrophobic core and a hydrophilic surface. Globular proteins often assume a stable structure, unlike fibrous proteins, which have multiple conformations. The three-dimensional structure of globular proteins is typically easier to determine through X-ray crystallography and nuclear magnetic resonance than both fibrous proteins and membrane proteins, which makes globular proteins more attractive for protein design than the other types of proteins. Most successful protein designs have involved globular proteins. Both RSD-1, and Top7 were de novo designs of globular proteins. Five more protein structures were designed, synthesized, and verified in 2012 by the Baker group. These new proteins serve no biotic function, but the structures are intended to act as building-blocks that can be expanded to incorporate functional active sites. The structures were found computationally by using new heuristics based on analyzing the connecting loops between parts of the sequence that specify secondary structures.[52]

Design of membrane proteins

[edit]

Several transmembrane proteins have been successfully designed,[53] along with many other membrane-associated peptides and proteins.[54] Recently, Costas Maranas and his coworkers developed an automated tool[55] to redesign the pore size of Outer Membrane Porin Type-F (OmpF) from E.coli to any desired sub-nm size and assembled them in membranes to perform precise angstrom scale separation.

Other applications

[edit]

One of the most desirable uses for protein design is for biosensors, proteins that will sense the presence of specific compounds. Some attempts in the design of biosensors include sensors for unnatural molecules including TNT.[56] More recently, Kuhlman and coworkers designed a biosensor of the PAK1.[57]

In a sense, protein design is a subset of battery design.[further explanation needed]

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Protein design is the interdisciplinary field of engineering proteins with novel three-dimensional structures and functions, typically by computationally determining sequences that fold into predefined conformations, often from scratch in a process known as de novo design. This approach inverts the classical problem, where the goal shifts from predicting a structure from a sequence to inventing sequences for targeted structures, leveraging principles of biophysical stability, energy minimization, and ary insights. Emerging as a cornerstone of , protein design enables the creation of proteins that natural has not produced, with applications in , , and . The field originated in the late 1980s with pioneering efforts to design simple helical bundles, including the first water-soluble, cooperatively folded four-helix bundle protein (α4) in 1987, which demonstrated that proteins could be rationally engineered using physicochemical principles without natural templates. Early advances focused on metalloproteins and basic motifs, such as the 1990 design of a zinc-binding protein, but computational limitations restricted complexity until the development of fragment-based methods in the 2000s. A landmark achievement came in 2003 with Top7, the first fully de novo protein featuring a novel fold verified by X-ray crystallography at 2.5 Å resolution, marking the transition to designing unprecedented topologies. Key methods in protein design combine computational modeling with experimental validation, including energy-based optimization via software like , which uses rotamer libraries and sampling to explore sequence-structure space. Recent breakthroughs integrate , such as the 2023 RFdiffusion model, a diffusion-based generative tool that produces diverse and structures with up to 50% experimental success rates, enabling symmetric assemblies and functional motif scaffolding. The 2024 highlighted these innovations, awarding David Baker for computational protein design—pioneering de novo proteins since Top7—and Demis Hassabis and John Jumper for AlphaFold2, which in 2020 achieved near-atomic accuracy in structure prediction, accelerating design cycles by informing sequence optimization with tools like ProteinMPNN. These AI-driven advances have boosted design fidelity, with success rates exceeding 10-20% for complex binders and >50% for stabilized scaffolds in recent studies. Protein design has transformative applications, including high-affinity binders for therapeutics, such as nanomolar inhibitors of or cancer checkpoints like , and self-assembling nanomaterials for drug delivery or vaccines, as seen in RSV designs. In , it facilitates custom enzymes to degrade plastics or PFAS pollutants. In , it yields programmable switches and sensors for cellular engineering, such as auxin-responsive biosensors. Ongoing challenges include enhancing functional diversity, stability, and scalability, but with AI integration, the field promises modular synthetic proteins for precision medicine and sustainable technologies; as of 2025, advances like for further expand capabilities.

Introduction

Definition and principles

Protein design is the computational engineering of sequences to fold into specified three-dimensional structures or perform targeted functions, representing the of natural where sequences determine structures. Unlike forward folding, which predicts structures from given sequences, protein design starts with a desired backbone or functional motif and generates compatible sequences that minimize free energy while achieving stability and specificity. This approach leverages biophysical principles such as hydrophobic packing, hydrogen bonding, and electrostatic interactions to ensure the designed proteins adopt the intended conformation. Key principles distinguish rational design, which modifies existing natural proteins by optimizing sequences around known scaffolds to enhance properties like stability or binding affinity, from de novo design, which creates entirely novel proteins without relying on natural templates. Rational design employs biophysical models to perturb sequences incrementally, often guided by evolutionary data or structural databases, while de novo design enumerates unprecedented folds using geometric constraints and energy minimization to explore beyond natural diversity. The basic workflow involves specifying a target structure, optimizing sequences via scoring functions that evaluate energetic compatibility, and validating designs through simulations or experimental assays like and . Protein design's importance lies in its ability to produce custom proteins that surpass natural limitations, enabling applications in such as novel therapeutics and , and in industry for biocatalysts and biomaterials. By transcending evolutionary constraints, it facilitates the creation of proteins with tailored properties, like high-affinity binders or symmetric assemblies, accelerating innovation in . Up to 2025, the field has shifted from purely physics-based methods to hybrid AI-physics approaches, exemplified by AlphaFold's accurate structure prediction enabling inverse design pipelines and RFdiffusion's generative modeling for de novo backbones.

Historical overview

The foundations of protein design were laid in the mid-20th century, building on insights into and structure. In 1973, Christian Anfinsen proposed the thermodynamic hypothesis, often referred to as , stating that the native structure of a protein is determined by its sequence under physiological conditions, as the sequence encodes the information needed to minimize free energy and achieve the lowest-energy conformation. This principle, derived from experiments on ribonuclease A refolding, provided the theoretical basis for designing sequences that could fold into predetermined structures. Early efforts in the 1970s and 1980s focused on manual, rational design of simple motifs, such as alpha-helical bundles, to test these ideas. A landmark example was William DeGrado's 1988 design of a four-helix bundle protein, synthesized from peptides that self-assembled into a stable, helical structure matching the intended model, demonstrating that de novo sequences could mimic natural folds. The 1990s marked the transition to computational methods, enabling systematic exploration of sequence space. David Baker's lab developed the Rosetta software suite starting in the mid-1990s, initially for ab initio structure prediction by assembling fragments from known protein structures using Monte Carlo sampling and energy minimization. A key algorithmic advance was the dead-end elimination (DEE) theorem introduced in 1992, which efficiently prunes suboptimal side-chain rotamers during optimization, drastically reducing the combinatorial search space for protein design. Building on this, John Desjarlais and Tracy Handel applied DEE in 1995 to redesign hydrophobic cores of proteins like thioredoxin, generating sequences that maintained stability and structure comparable to wild-type, validating computational core repacking as a viable design strategy. In the 2000s, computational design achieved novel folds and functions, shifting from motif mimicry to de novo creation. Brian Kuhlman and colleagues in Baker's lab reported in 2003 the design of Top7, the first protein with a novel fold not observed in nature, where a 93-residue folded into a mixed alpha-beta structure with atomic accuracy (RMSD 1.6 to the model), confirmed by . Progress accelerated with functional designs; in 2008, the same group engineered de novo enzymes catalyzing the , achieving rate accelerations up to 10^6-fold through active-site optimization in computationally generated scaffolds. These successes highlighted the potential for designing proteins with tailored catalytic properties. The saw expansions to complex architectures, particularly symmetric assemblies, while exposing challenges in certain classes like proteins. Baker's lab designed self-assembling protein cages, such as a 120-subunit icosahedral in with high thermal stability (melting temperature >100°C), enabling applications in . Efforts to design proteins lagged due to difficulties in modeling environments and conformational dynamics, with early successes limited to small helical bundles rather than full transporters. The 2020s ushered in an AI-driven revolution, leveraging for unprecedented generative capabilities. DeepMind's AlphaFold2, released in 2020, achieved near-experimental accuracy in structure prediction (median GDT-TS 92.4 on CASP14 targets), inverting the design process by allowing back-prediction of sequences from structures. The lab's RoseTTAFold in 2021 extended this with a three-track for joint sequence-structure co-design, enabling rapid generation of binder proteins. Generative models proliferated, including RFdiffusion (2023), a diffusion-based method that hallucinates novel backbones conditioned on motifs, yielding designs with 40% experimental success rates for diverse folds. Concurrently, the hallucination paradigm, refined in 2023, used to optimize random sequences against structure prediction losses, producing luciferases and repeat proteins with novel topologies validated by cryo-EM. By 2025, these AI tools continued to advance scalable protein design methods, such as relaxed sequence optimization, enabling the creation of larger proteins and high-affinity interactions with structural validation. Recent developments as of 2025 include AI-powered designs for and enhanced applications.

Fundamentals of Protein Structure

Hierarchical structure levels

Proteins exhibit a of structure that serves as the foundational framework for computational and rational design efforts, allowing engineers to specify target architectures at multiple scales without preconceived sequence biases. This hierarchy comprises four levels—primary, secondary, tertiary, and —each building upon the previous to dictate stability, function, and interactions. Understanding these levels is essential for protein design, as it enables the independent manipulation of backbone geometries and subunit arrangements to achieve desired properties, such as enhanced enzymatic activity or novel binding affinities. The primary structure refers to the linear sequence of linked by peptide bonds, which constitutes the fundamental blueprint for all higher-order folding and serves as the primary input variable in de novo protein design. This sequence, determined experimentally through methods like , dictates the chemical properties and potential interactions that drive subsequent structural assembly, as exemplified by Frederick Sanger's sequencing of insulin, which revealed the precise order of its 51 across two chains connected by bonds. In design contexts, specifying or optimizing the primary structure allows for targeted modifications, such as introducing cysteines for bridging or polar residues for , while ensuring compatibility with intended folds. Secondary structure encompasses local, repeating patterns stabilized primarily by hydrogen bonds between backbone atoms, including alpha-helices, beta-sheets, and connecting loops or turns that contribute to overall rigidity and functional motifs. Alpha-helices feature a right-handed coil with 3.6 residues per turn, while beta-sheets form pleated arrangements of hydrogen-bonded strands, either parallel or antiparallel, as first proposed by and Robert Corey based on stereochemical constraints. These elements are critical for design because they provide modular scaffolds for stability; for instance, packing helices into bundles or sheets into barrels enhances thermal resilience, informing the selection of backbones that support catalytic sites or ligand-binding pockets without sequence-dependent biases. Tertiary structure describes the global three-dimensional folding of a single polypeptide chain, achieved through long-range interactions such as hydrophobic collapse into a core, hydrogen bonds, electrostatic forces, and disulfide bridges that minimize free energy and yield a compact, functional conformation. Christian Anfinsen's experiments on demonstrated that the native tertiary fold is thermodynamically determined by the primary sequence under physiological conditions, underscoring the principle that design targets must prioritize energetically favorable arrangements, like burying nonpolar residues to form stable cores. In , tertiary specification involves defining domain architectures—such as all-alpha or mixed motifs—to encode specific functions, enabling the creation of proteins with novel topologies for therapeutic applications. Quaternary structure arises when multiple polypeptide chains (subunits) assemble into a multi-subunit complex, stabilized by non-covalent interactions and sometimes covalent links, resulting in symmetric or asymmetric oligomers that amplify function, such as . Max Perutz's crystallographic analysis of revealed its tetrameric arrangement of two alpha and two beta chains, with interfaces enabling cooperative oxygen binding, highlighting how quaternary design can introduce regulatory mechanisms or increased . For protein engineers, targeting quaternary levels allows the construction of oligomeric assemblies, like symmetric cages or signaling complexes, by specifying subunit interfaces that promote and enhance stability or specificity . Visualization of these hierarchical levels is facilitated by resources like the (PDB), which archives experimentally determined structures, and software such as PyMOL, which renders atomic models to inspect folds, interfaces, and dynamics at resolutions down to angstroms. This capability is prerequisite for design workflows, as it permits the abstraction of backbones from natural templates or ideal geometries, decoupling structure specification from evolutionary sequence constraints to innovate novel proteins.

Sequence-to-structure mapping

The sequence-to-structure mapping refers to the biophysical process by which an sequence determines the three-dimensional structure of a protein through folding. This mapping is central to protein design, as designing novel proteins requires predicting how a proposed sequence will fold into a desired structure. highlights the computational intractability of this process: for a 100-residue protein assuming approximately three possible conformations per residue, the total number of possible conformations is on the order of 31005×10473^{100} \approx 5 \times 10^{47}, far exceeding the age of the even if sampled at rates. This paradox is resolved by the concept, where the energy landscape guides the protein toward the native state via a biased, downhill pathway rather than random sampling, minimizing and enabling folding on biologically relevant timescales. Folding mechanisms underpin this mapping, as articulated by Anfinsen's thermodynamic hypothesis, which posits that the native structure is the global free energy minimum determined solely by the sequence under physiological conditions. In vivo, molecular chaperones assist this process by preventing aggregation and facilitating proper folding pathways, particularly for larger proteins. The vastness of further complicates the mapping: for a 100-residue protein, there are 201001013020^{100} \approx 10^{130} possible sequences, yet natural proteins represent only a minuscule fraction of the total space. This sparsity underscores the evolutionary selection for sequences that reliably map to functional structures. The entropy of sequence diversity can be quantified using Shannon entropy, S=pilogpiS = -\sum p_i \log p_i, where pip_i is the probability of the ii-th at a position, highlighting the information content required for specific folding. Advances in structure prediction have revolutionized understanding of sequence-to-structure mapping. Prior to 2020, methods relied heavily on , which aligned query sequences to known structures using templates like those in the , achieving moderate accuracy for homologous proteins but struggling with novel folds. Post-2020, approaches such as dramatically improved predictions; 2 achieved near-atomic accuracy across diverse structures, while 3 extended this to multimers, ligands, and modifications with median backbone RMSDs below 1 Å for many complexes. In protein design, the —finding sequences that fold to a target structure—has seen success rates evolve from below 10% in the , limited by simplistic energy models and computational power, to 10–50% or higher in the 2020s using integrated physics- and machine learning-based methods. These improvements enable the generation of stable, functional proteins, bridging the gap between sequence prediction and de novo design.

Conformational flexibility and dynamics

Proteins are not static structures but exhibit conformational flexibility, which is essential for their biological functions such as enzymatic , ligand binding, and . In protein design, accounting for this flexibility is crucial to ensure stability, prevent misfolding, and enable functional dynamics, as rigid designs may fail to mimic native behaviors. Conformational flexibility manifests in several types, including side-chain rotamers that allow discrete torsional adjustments for optimizing interactions and adapting to environments; backbone fluctuations that permit local hinge-like movements and loop adjustments; and allostery, where perturbations at one site propagate structural changes to distant regions, modulating activity. These dynamics arise from motions and are influenced by composition, with designs needing to balance rigidity for folding and flexibility for . Normal mode analysis provides a computational framework to model protein dynamics by identifying low-frequency vibrational modes that capture large-scale, collective motions such as domain shifts or helix rotations, which are relevant for predicting functional transitions in designed proteins. This approach, often using elastic network models, efficiently approximates essential dynamics without exhaustive simulations, aiding designers in incorporating anticipated flexibility into target structures. Ensemble views of proteins emphasize that conformations follow a , where states are populated according to their relative energies, necessitating designs that stabilize desired ensembles rather than single structures to achieve robust function. methods trained on data can generate such ensembles rapidly, ensuring compatibility across multiple states and avoiding entrapment in suboptimal conformations. Challenges in incorporating conformational flexibility include the risk of over-stabilization, which can induce rigidity and impair adaptive functions, and underestimation of dynamics, leading to sequences prone to misfolding or aggregation due to unexplored alternative states. These issues highlight the need for multi-state optimization to smooth energy landscapes and promote funnel-like folding pathways. Experimental validation of designed protein flexibility relies on techniques like (NMR) spectroscopy, which resolves multistate structures, and (MD) simulations, which quantify motional amplitudes; for example, deep learning-designed dynamic proteins have shown conformational equilibria and interaction networks matching predictions, with NMR confirming atomic-level precision in flexible states comparable to those in native proteins. Recent advances in 2025 integrate AI with for flexible designs, such as AlphaFold-Metainference, which leverages AlphaFold-predicted distances as restraints in replica-exchange simulations to generate Boltzmann-consistent ensembles of disordered and partially structured proteins, improving agreement with experimental data like . This approach enables efficient exploration of dynamic landscapes, facilitating the creation of proteins with tailored flexibility for applications in sensing and regulation.

Design Principles and Challenges

Target structure specification

Target structure specification in protein design involves defining the desired three-dimensional backbone or as a starting point for subsequent sequence optimization, ensuring the geometry supports stability, novelty, and potential function. This step is crucial because the backbone dictates the overall , secondary structure elements, and spatial arrangement of residues, which in turn influence foldability and interactions. Designers typically generate or select scaffolds that avoid existing natural to enable de novo creation, while incorporating features like binding pockets or active sites for targeted applications. Several methods exist for specifying target structures. Enumerative approaches systematically assemble idealized secondary structure elements, such as alpha-helices and beta-sheets, from a predefined library of building blocks to enumerate possible topologies exhaustively, as demonstrated in algorithms that generate diverse pocket geometries in NTF2-fold scaffolds. Fragment assembly, pioneered in the Rosetta software suite, involves stitching together short segments (typically 3-9 residues) derived from known protein structures in the Protein Data Bank (PDB) to build novel backbones, reducing the search space while maintaining physical realism; this method was key to early de novo designs by iteratively sampling conformations via Monte Carlo optimization. More recently, generative models based on diffusion processes have emerged, particularly post-2020, where noise is added to and then denoised from protein coordinates to produce diverse scaffolds conditioned on constraints like symmetry or motifs, enabling rapid generation of unprecedented folds. Key criteria guide the selection of target structures. For stability, backbones are evaluated using metrics like the Template Modeling (TM)-score, where values above 0.5 indicate a high likelihood of adopting the intended fold upon sequence realization, as this threshold correlates with topological similarity to native proteins. Novelty is assessed by ensuring no close homologs exist in the PDB, often via structural alignment tools like DALI or TM-align, to confirm the design explores untapped sequence-structure space. Functionality requires precise geometry for features such as active sites, where distances and angles must align with catalytic or binding requirements, often verified through docking simulations. Prominent tools facilitate backbone generation and functionalization. RFdiffusion, a fine-tuned RoseTTAFold-based released in 2023, generates high-quality and multimer backbones de novo or conditioned on partial motifs, achieving experimental success rates over 20% for fold validation in blind tests. Motif grafting integrates functional elements, such as active sites or epitopes, into these scaffolds using protocols that optimize loop connections and interface packing to preserve without steric disruption. Challenges in this specification phase include ensuring the backbone is foldable with natural , as many generated structures may lack compatible sequences due to strained geometries or unfavorable energetics. Avoiding steric clashes between non-local residues is another hurdle, requiring iterative refinement to eliminate overlaps that could destabilize the fold during realization. Seminal examples illustrate these principles. The Top7 protein, designed in 2003, used fragment assembly in to specify a novel α/β fold with no natural homologs (TM-score <0.3 to closest PDB entries), resulting in an experimentally validated structure with 1.2 Å RMSD to the computational model. More recently, RFdiffusion-enabled hallucination of binders in 2023 produced de novo scaffolds that bound diverse targets like IL-7 and PD-1 with nanomolar affinities, incorporating specified geometric constraints for interfaces while confirming novelty through PDB searches.

Energy functions and scoring

Energy functions in protein design serve as mathematical models to assess the compatibility of an amino acid sequence with a target structure by estimating the free energy of the system. These functions typically approximate the Gibbs free energy ΔG, guiding the selection of sequences that minimize energetic frustration and stabilize the desired fold. Energy functions are broadly classified into physics-based and knowledge-based categories. Physics-based functions derive terms from fundamental physical principles, such as atomic interactions, while knowledge-based functions rely on statistical potentials extracted from structural databases like the . The Rosetta energy function exemplifies a hybrid approach, combining physics-based terms for short-range interactions with knowledge-based statistical potentials for conformational preferences. Key components of such energy functions include van der Waals interactions, modeled via to capture steric repulsion and attraction; electrostatics, computed using Coulomb's law with a distance-dependent dielectric; solvation effects, often via generalized Born/surface area (GB/SA) models to account for polar and nonpolar desolvation; hydrogen bonding, with orientation-dependent terms for donor-acceptor geometry; and torsion potentials, enforcing backbone Ramachandran and side-chain rotamer preferences. The total energy is expressed as a weighted sum: ΔEtotal=iwiEi(θ,aa)\Delta E_\text{total} = \sum_i w_i E_i(\theta, \text{aa}) where wiw_i are empirical weights, EiE_i are individual terms, θ\theta denotes conformational variables like dihedral angles, and aa represents amino acid identities. Statistical potentials in knowledge-based components use reference states derived from alignments, such as Boltzmann-distributed frequencies of residue pairs or backbone angles relative to an unfolded ensemble, to define favorable interactions. These reference states enable the calculation of effective energies that correlate with observed native structures. Despite their utility, energy functions face challenges, including inaccuracies in non-native contexts where they may overestimate hydrophobic burial stability or underpenalize polar group desolvation, leading to suboptimal sequence rankings. Additionally, most functions omit explicit conformational entropy terms to maintain computational tractability, hindering accurate modeling of backbone and side-chain flexibility. In optimization, partial derivatives like E/θ\partial E / \partial \theta for rotamer angles are computed to minimize the energy landscape efficiently. Validation of energy functions often involves correlating predicted energy changes with experimental ΔΔG values from mutagenesis studies; for instance, the Rosetta function achieves a Pearson correlation coefficient R = 0.994 for ΔΔG upon mutation on its optimization dataset, while performance on independent blind tests is typically lower (Pearson r ≈ 0.3–0.8 depending on the protocol and dataset). Recent machine learning advancements, such as those in trRosetta, have improved potentials by incorporating deep learning predictions of interresidue orientations, enhancing accuracy in structure prediction and design tasks during the 2020s. Recent developments include machine learning-based energy functions, such as deep learning-derived coarse-grained force fields that predict protein structures and dynamics with high accuracy.

Sequence space exploration

Protein sequence space exploration in design involves navigating the vast combinatorial landscape of possible amino acid sequences—estimated at 20^N for an N-residue protein—to identify those that stably adopt a target structure, without relying on exhaustive brute-force search due to computational infeasibility. Traditional approaches discretize this space using rotamer libraries, which represent side-chain conformations observed in protein structures, such as the backbone-dependent Dunbrack library containing approximately 10 to 100 rotamers per amino acid type derived from clustering empirical data from the . This discretization reduces the per-residue search space from continuous dihedral angles to a manageable discrete set, enabling optimization techniques like dead-end elimination to prune incompatible combinations early. Clustering further refines these libraries by grouping similar rotamers, minimizing redundancy while preserving conformational diversity essential for realistic packing. Continuous aspects of the sequence space, particularly backbone sampling and side-chain packing, introduce additional complexity beyond discrete rotamers. Backbone sampling generates low-energy conformational ensembles using methods like fragment assembly, allowing flexibility in phi/psi dihedrals to explore viable folds, while side-chain packing optimizes rotamer assignments conditioned on the backbone to minimize steric clashes and maximize favorable interactions. For small proteins (e.g., <50 residues), exhaustive enumeration of sequence-rotamer combinations is feasible, yielding global minima, but for larger systems, approximations such as Monte Carlo sampling are employed to stochastically traverse the space, iteratively perturbing sequences and conformations to escape local minima. Success in exploration is gauged by metrics like low-energy sequences, typically those scoring below -2 Rosetta Energy Units (REU) per residue using the Rosetta all-atom energy function, indicating thermodynamic stability comparable to natural proteins. Diversity is enhanced through Monte Monte Carlo methods that incorporate temperature parameters to sample a broader range of viable sequences, preventing convergence to homogeneous solutions and promoting robustness. Recent advances leverage machine learning, particularly protein language models like ESM-2, which use transformer architectures trained on evolutionary sequences to generate embeddings that guide sequence sampling in underrepresented regions of the space. Post-2022 neural network approaches, including generative models, enable direct exploration of novel sequence variants by inverting structure-to-sequence mappings or conditioning on structural motifs, as demonstrated in global generative frameworks that sample across the entire protein universe. By 2025, extensions like retrieval-augmented ESM variants incorporate homologous sequences to refine predictions, accelerating discovery of diverse, functional designs.

Biosecurity risks

AI-driven de novo protein design introduces significant biosecurity risks, representing a double-edged sword by enabling the creation of arbitrary biological structures on a computer, which could be misused to engineer harmful proteins or pathogens. Advances in tools like AlphaFold and diffusion models have democratized the ability to design novel proteins with unprecedented speed and accuracy, potentially allowing non-state actors to develop biothreats without traditional laboratory infrastructure. For instance, computational design could facilitate the optimization of toxins or virulence factors beyond natural evolutionary limits, raising concerns about dual-use research. Biosecurity experts emphasize the need for enhanced screening protocols, international governance frameworks, and AI safeguards to mitigate these risks while preserving beneficial applications in health and sustainability. A 2025 report highlights how AI-enabled synthetic biology could uniquely amplify biosecurity threats through rapid iteration and accessibility.

Computational Methods

Optimization formulations

Protein design is formalized as a mathematical optimization problem that seeks amino acid sequences or structural configurations compatible with a desired three-dimensional fold, typically by minimizing an energy function derived from biophysical models. The core challenge lies in navigating the enormous sequence space—approximately 20 possibilities per residue—while ensuring the designed protein adopts the target conformation with high stability and, if applicable, specific functional properties. This setup contrasts with protein structure prediction, which infers structure from sequence, by inverting the process to engineer sequences for predefined structures. A primary problem type is sequence design given a fixed target structure, formulated as minimizing the conditional energy E(sequencestructure)E(\text{sequence} \mid \text{structure}), where the energy function decomposes into terms for intra-residue interactions, pairwise residue contacts, and solvation effects. For instance, the total energy is often expressed as E=E0+iEi(ri)+i<jEij(ri,rj)E = E_0 + \sum_i E_i(r_i) + \sum_{i<j} E_{ij}(r_i, r_j), with rir_i denoting the rotamer (discrete side-chain conformation) at residue ii, EiE_i the unary term, and EijE_{ij} the pairwise term. In structure design, joint optimization extends this to simultaneously optimize sequence and backbone coordinates, coupling sequence compatibility with conformational sampling. The objective generally minimizes energy subject to foldability constraints, such as ensuring the target conformation has lower energy than decoy structures; multi-objective variants trade off stability (e.g., via folding free energy) against function (e.g., binding specificity), often yielding Pareto-optimal sets of sequences. Combinatorial and continuous formulations address the discrete or flexible nature of protein degrees of freedom. In the combinatorial approach, side chains are discretized into rotamer libraries, leading to integer programming models: binary variables xi,k=1x_{i,k} = 1 if rotamer kk is selected for residue ii, with constraints like kxi,k=1\sum_k x_{i,k} = 1 (one rotamer per residue) and linear inequalities preventing steric clashes (e.g., pairwise exclusion). This yields a 0/1 integer linear or quadratic program. Continuous formulations, by contrast, optimize torsion angles ϕ,ψ\phi, \psi for backbone and χ\chi angles for side chains directly, relaxing the discrete search to a differentiable landscape suitable for gradient-based methods, though requiring approximations for non-convexity. The general design equation is minxE(x)s.t.g(x)0,h(x)=0,\min_x E(x) \quad \text{s.t.} \quad g(x) \leq 0, \quad h(x) = 0, where xx is the sequence vector (or extended to include angles in joint cases), E(x)E(x) the energy, and constraints g,hg, h enforce steric feasibility and fold specificity. The discrete protein design problem is NP-hard, with computational complexity scaling exponentially in the number of residues due to the combinatorial explosion of possible assignments, necessitating approximations or heuristics for practical scales beyond small peptides. Stochastic formulations incorporate uncertainty from conformational dynamics or noisy energy estimates by optimizing expected values, such as minxE[E(x)]\min_x \mathbb{E}[E(x)] over an ensemble of structures, using probabilistic sampling to model flexibility and robustness. These handle ensemble-averaged properties, like partial unfolding risks, but introduce variability in solutions compared to deterministic setups.

Algorithms with mathematical guarantees

Algorithms with mathematical guarantees in protein design focus on exact optimization techniques that provably identify the global minimum energy conformation (GMEC) or provide tight bounds on the optimal solution, typically formulated as finding the lowest-energy sequence and rotamer assignment for a given backbone structure. These methods address the combinatorial explosion of the sequence-to-structure mapping by leveraging pruning, bounding, or integer programming to ensure optimality without exhaustive enumeration, though they are computationally intensive for large proteins. They contrast with heuristic approaches by offering formal proofs of correctness, often building on energy functions that decompose into pairwise interactions between residues. Dead-end elimination (DEE) is a cornerstone algorithm that iteratively prunes suboptimal rotamers from consideration, guaranteeing the identification of the GMEC when no further eliminations are possible. The core criterion eliminates a rotamer rir_i at residue position kk if its minimum possible energy in any conformation exceeds the maximum possible energy of any alternative rotamer rjr_j (where jij \neq i): minconfriE(conf)>maxconfrjE(conf)\min_{\text{conf} \ni r_i} E(\text{conf}) > \max_{\text{conf} \ni r_j} E(\text{conf}) This is approximated using bounds on pairwise interactions, such as E(kri)+lkminrlE(kri,lrl)>E(krj)+lkmaxrlE(krj,lrl)E(k_{r_i}) + \sum_{l \neq k} \min_{r_l} E(k_{r_i}, l_{r_l}) > E(k_{r_j}) + \sum_{l \neq k} \max_{r_l} E(k_{r_j}, l_{r_l}), enabling efficient reduction of the search space from millions to thousands of rotamers per site. Introduced in its generalized form for protein design, DEE has been extended with perturbations (DEEPer) to handle continuous side-chain flexibility by sampling perturbations around discrete rotamers and tightening bounds iteratively. Multistate variants, like type-dependent DEE, further prune by considering multiple target conformations simultaneously. Branch-and-bound (BnB) algorithms perform an exact tree search over the rotamer space, using upper and lower energy bounds to prune branches that cannot contain the GMEC, thus guaranteeing optimality while avoiding full enumeration. The search proceeds depth-first or best-first, evaluating partial assignments and discarding subtrees where the lower bound exceeds the current best upper bound on the global energy. A* variants enhance by incorporating admissible heuristics, such as relaxations of the energy function, to guide the expansion toward low-energy regions; for instance, BroMAP combines BnB with mean-field approximations for tighter bounds in multistate designs. BnB formulations exploit the graphical of protein interaction graphs to decompose the problem, reducing complexity for symmetric or modular proteins. These methods have successfully designed sequences for folds by exhaustively exploring constrained spaces. Integer programming (often formulated as a mixed-integer quadratic program (MIQP), which can be linearized to an (ILP)) reformulates protein design as an over binary variables indicating rotamer selections, with linear constraints ensuring at most one rotamer per site and compatibility between interacting residues. The objective minimizes the total energy, expressed as minkrkck,rkxk,rk+k<lrk,rlek,l,rk,rlxk,rkxl,rl\min \sum_{k} \sum_{r_k} c_{k,r_k} x_{k,r_k} + \sum_{k<l} \sum_{r_k,r_l} e_{k,l,r_k,r_l} x_{k,r_k} x_{l,r_l}, where xk,rkx_{k,r_k} are binary indicators and c,ec, e are self and pairwise energies; LP relaxations provide bounds, and branch-and-cut solvers like Gurobi yield exact . This approach handles continuous dihedral angles via mixed-integer extensions and has been applied to side-chain packing and sequence optimization, with cluster expansions accelerating large instances by approximating higher-order terms. ILP guarantees the GMEC for discrete models and scales via commercial solvers. Message-passing approximations, such as loopy belief propagation and max-product message passing, provide dual bounds to the LP relaxation of the protein design graphical model, enabling provable optimality gaps for the GMEC. These algorithms iteratively propagate marginal beliefs over rotamer variables along the interaction graph, converging to a stationary point that lower-bounds the minimum energy; the dual formulation ensures the bound is tight for tree-structured graphs and approximate otherwise. Tree-reweighted variants further tighten relaxations by reweighting messages to encourage consistency, while max-sum belief propagation solves the dual efficiently for partial assignments. In protein design, they integrate with BnB to guide pruning, offering guarantees on suboptimality when combined with exact solvers. These exact methods perform well for proteins under 100 residues, often solving instances with 10-20 mutable sites in seconds to minutes on modern hardware, and excel in symmetric or low-flexibility designs where the search space is tractable. For larger systems, exhaustive optimality remains challenging due to NP-hardness, but successes include designing symmetric oligomers and enzyme active sites with verified low-energy sequences.

Heuristic and AI-driven approaches

Heuristic approaches in protein design prioritize computational efficiency over exact optimality, employing stochastic or approximate inference techniques to navigate the vast sequence and conformation spaces. Monte Carlo methods, integrated into the Rosetta software suite, sample protein conformations and sequences by proposing random perturbations and accepting or rejecting them based on energy changes. Simulated annealing enhances this by incorporating a temperature parameter that decreases over iterations via predefined cooling schedules, allowing temporary acceptance of higher-energy states to escape local minima. The acceptance probability follows the Metropolis criterion, where a move with energy increase ΔE is accepted with probability exp(-ΔE / kT), with k as the Boltzmann constant and T as the current temperature. This approach has been foundational in Rosetta for both structure prediction and design tasks since the late 1990s. The FASTER algorithm represents an advanced heuristic for side-chain placement and sequence optimization in protein design, achieving rapid enumeration by iteratively pruning rotamer libraries to smaller, promising subsets while maintaining near-optimal energy scores. By relaxing only select positions during perturbations and using initial configurations that bias toward low-energy states, FASTER delivers up to two orders of magnitude speedup over traditional dead-end elimination or Monte Carlo methods, reducing computation from days to hours for complex designs. This enables practical application to multistate design problems, where sequences must satisfy multiple conformational states. Belief propagation offers another approximate inference strategy, modeling protein design as a probabilistic graphical model where variables represent amino acid choices and factors encode interaction energies. The algorithm performs iterative message passing between nodes to marginalize probabilities, converging to approximate optima for low-energy sequences without exhaustive enumeration. This method excels in capturing pairwise and higher-order dependencies, providing marginal amino acid probabilities that guide sequence selection in large systems. Modern AI-driven methods leverage deep learning for scalable protein design, particularly generative models like variational autoencoders (VAEs) and diffusion models that learn latent representations of protein structures and sequences from large datasets. ProteinMPNN, a message-passing neural network introduced in 2022, generates sequences conditioned on fixed backbones by autoregressively predicting residues from N- to C-terminus, incorporating structural features such as inter-residue distances and dihedral angles. Trained on over 19,000 Protein Data Bank structures and fine-tuned with structural noise for robustness, it achieves 52.4% native sequence recovery—superior to Rosetta's 32.9%—and designs functional proteins for monomers, oligomers, and interfaces, validated experimentally via crystallography and cryo-EM. Hallucination protocols extend these AI techniques to de novo backbone generation, using denoising diffusion models to sample novel folds from noise. RFdiffusion, built on RoseTTAFold as the denoising backbone, iteratively refines random residue frames over up to 200 steps, enabling topology-constrained design of unprecedented structures like TIM barrels and symmetric assemblies. It generates 100-residue proteins in seconds on consumer GPUs, outperforming prior hallucination methods in diversity and accuracy, with experimental validation of large oligomers up to 1,050 residues via negative-stain electron microscopy. Complementing this, the 2023 Chroma model integrates diffusion with graph neural networks for conditional generation, allowing user-specified constraints such as symmetry, shape, or natural-language prompts to produce novel protein complexes exceeding 3,000 residues in minutes on standard hardware. Recent advancements in AI, such as AlphaFold 3 introduced in 2024, have further enhanced de novo protein design by improving the accuracy of structure prediction for complexes, enabling the generation of novel structures and sequences that form custom protein-based molecular machines. AlphaFold 3's diffusion-based architecture supports joint prediction of biomolecular interactions, facilitating inverse design approaches where sequences are engineered for predefined folds with unprecedented precision. Frameworks like AlphaDesign, developed in 2025, integrate AlphaFold for hallucination-based de novo design, allowing the creation of diverse protein classes including monomers, oligomers, and site-specific binders with high generality and usability. These methods, validated through experimental structures, enable the design of proteins that surpass natural evolutionary constraints, supporting applications in custom enzymes and therapeutic agents. These heuristic and AI approaches yield speedups of over 1,000-fold relative to exact optimization methods like branch-and-bound, facilitating designs intractable for exhaustive search while recovering near-native sequences and folds. By 2025, they have enabled successes in engineering large protein assemblies, including modular self-assembling nanomaterials and symmetric nanoparticles validated by high-resolution structural biology, accelerating applications in therapeutics and materials. In 2024, advancements like AI frameworks incorporating experimental feedback have further improved design efficiency for applications in medicine and catalysis.

Applications

De novo and novel fold design

De novo protein design involves the computational creation of proteins with entirely novel structures that do not exist in nature, relying on principles of physics and biology to specify backbones and sequences from scratch. This approach contrasts with template-based methods by generating unprecedented folds, enabling the exploration of new topological space. Key strategies include scaffold design, where idealized structural motifs like beta-barrels are assembled into stable cores, and fold hallucinations, which use deep neural networks to generate diverse backbone conformations without relying on existing templates. A landmark example is Top7, a 93-residue α/β protein designed in 2003 with a novel fold unrelated to any natural protein, folding into its intended structure. Scaffold-based designs have produced functional beta-barrels, such as eight-stranded transmembrane variants that insert into lipid membranes and exhibit high thermal stability exceeding 50°C, confirmed by circular dichroism spectroscopy. More recent advances include de novo metalloproteins, like an expandable platform incorporating redox-active heme groups into novel folds for electron transfer applications. Validation of these designs typically involves biophysical characterization, with X-ray crystallography providing atomic-level confirmation; for instance, the Top7 structure matched its computational model with a root-mean-square deviation of 1.2 Å, and designed beta-barrels have shown near-perfect agreement to predicted backbones. Thermal denaturation experiments often reveal melting temperatures above 50°C, indicating robust folding in aqueous environments. These metrics underscore the fidelity of modern design tools in producing stable, novel architectures. Despite these successes, challenges persist, including variable experimental success rates for folding into intended structures due to inaccuracies in energy functions and sampling limitations. Integrating function into novel folds remains difficult, often requiring iterative refinement. By 2025, AI-driven methods have advanced applications, such as de novo mini-proteins designed as potent inhibitors of the MERS-CoV spike protein, achieving nanomolar binding affinities and protection in cell models. Additionally, recent de novo enzymes, like porphyrin-containing catalysts with stereoselective activity for carbon-carbon bond formation, highlight progress in functional novelty. These designs break evolution's limits by creating enzymes for industrial tasks that nature never had a reason to evolve, such as highly efficient carbon capture using de novo carbonic anhydrase enzymes.

Enzyme and catalyst engineering

Enzyme and catalyst engineering involves the computational and experimental creation of proteins that accelerate chemical reactions, often by precisely positioning catalytic residues to stabilize transition states. A key approach is theozyme placement, where an ideal catalytic motif—termed a theozyme—modeling the transition state geometry is docked into protein scaffolds to identify suitable backbones that can support the required interactions. This is followed by scaffold matching, an automated process that scans protein structures for backbone fragments compatible with the theozyme, ensuring geometric and energetic feasibility for catalysis. These methods enable de novo design of active sites in existing or novel folds, prioritizing electrostatic and hydrogen-bonding networks to lower activation barriers. Early successes demonstrated the viability of this paradigm with the design of Kemp eliminases in 2008, where theozyme-based placement into diverse scaffolds yielded enzymes catalyzing the Kemp elimination reaction—a proton abstraction and bond-breaking process—with k_cat values up to 700 min⁻¹ for the KE70 variant, marking a milestone in non-natural catalysis. Similarly, retro-aldolases designed that year used four distinct theozymes to break carbon-carbon bonds in a non-natural substrate, achieving detectable activity across 32 of 72 tested designs spanning multiple folds, with k_cat/K_M efficiencies reaching 10² M⁻¹ s⁻¹. These examples highlighted how scaffold matching can repurpose protein architectures for xenobiotic reactions, though initial efficiencies were modest compared to natural enzymes. To enhance performance, semi-rational strategies combine computational design with directed evolution, iteratively refining active sites through mutagenesis and selection. For instance, cytochrome P450 variants like CYP102A1 (P450BM3) have been engineered for selective oxidations of pharmaceuticals, where initial Rosetta-based designs predict substrate binding, followed by evolution yielding variants with >100-fold improved and k_cat/K_M >10³ M⁻¹ s⁻¹ for specific substrates like testosterone. This hybrid approach addresses design inaccuracies by leveraging evolutionary optimization for specificity and stability, as seen in variants achieving >90% enantioselectivity in sulfoxidation. Recent advances incorporate / (QM/MM) hybrids to refine energy functions, providing higher accuracy in modeling over classical methods alone; for example, QM/MM simulations have improved predictions of electrostatic contributions in Kemp eliminase active sites by 20-30% in barrier heights. In 2025, de novo luciferases designed via AI-guided theozyme placement and scaffold generation enabled multiplexed imaging, with neoLux variants exhibiting >10-fold brighter emission than prior designs and orthogonal substrate specificity for applications. AI models further aid reaction prediction, using to forecast catalytic motifs and efficiencies, as in generative frameworks that hallucinate sequences for uncharted reactions with >80% validation success in wet-lab tests. Recent applications include AI-designed enzymes for degrading plastics, enabling sustainable waste management by breaking down persistent pollutants like PET. Designer proteins also replace toxic chemical catalysts in manufacturing, promoting sustainable chemistry. These developments underscore ongoing progress toward enzymes rivaling natural catalysts in rate and selectivity.

Therapeutic and binding proteins

Protein design for therapeutic and binding applications focuses on proteins that recognize and interact with specific molecular , such as pathogens, cancer cells, or disease-related proteins, to enable , neutralization, or immune modulation. These designs prioritize high-affinity binding while minimizing off-target effects, often leveraging computational methods to optimize protein-protein interfaces. Key include viral receptors, tumor antigens, and signaling molecules, where binders serve as inhibitors, diagnostic tools, or components in immunotherapies. Interface design in therapeutic proteins emphasizes hotspot residues—specific that contribute disproportionately to in protein-protein interactions—to create stable, high-affinity complexes. By computationally identifying and optimizing these hotspots, designers can sculpt interfaces that mimic natural but with enhanced stability or novel scaffolds. For antibody engineering, CDR () grafting transfers the antigen-binding loops from a non-human antibody onto a framework to reduce while preserving specificity; this method has been refined computationally to select optimal framework matches that maintain CDR conformation. Binding affinity is evaluated using protocols like RosettaΔΔG, which estimates the change in binding free energy (ΔΔG) upon or by sampling conformational ensembles and scoring interactions; designs achieving ΔΔG < -2 kcal/mol indicate significant affinity improvements suitable for therapeutic use. A simplified approximation for binding free energy in these models is ΔGbindΔEvdw+ΔEele\Delta G_{\text{bind}} \approx \Delta E_{\text{vdw}} + \Delta E_{\text{ele}}, where van der Waals (ΔEvdw\Delta E_{\text{vdw}}) and electrostatic (ΔEele\Delta E_{\text{ele}}) terms dominate interface energetics, though full protocols incorporate solvation and entropy. To ensure specificity and avoid off-target binding, negative design incorporates constraints that penalize interactions with non-target proteins, such as by disfavoring homodimerization or cross-reactivity in computational scoring. Exemplary applications include de novo miniprotein binders to the SARS-CoV-2 spike protein receptor-binding domain, designed in 2020 with picomolar affinities (e.g., <1 nM dissociation constants) that block viral entry by competing with the ACE2 receptor. Computationally designed inhibitors, such as those targeting amyloid aggregation in Alzheimer's disease, demonstrate how interface optimization yields stable complexes that halt pathogenic protein misfolding. Recent advances incorporate AI-driven methods, like RFdiffusion, to design affimer-like non-antibody scaffolds with tailored specificity for therapeutic targets, expanding beyond traditional antibodies. Recent AI-driven designs include ultra-targeted cancer therapeutics, such as de novo proteins that enhance T-cell targeting of tumor cells with high specificity using multi-target approaches. Clinically, bispecific antibodies have seen FDA approvals in 2025, including linvoseltamab (Lynozyfic) for relapsed multiple myeloma, enhancing T-cell redirection with engineered affinities. In CAR-T therapies, designed protein binders boost antitumor activity by improving antigen recognition and reducing exhaustion, as shown in constructs targeting glioblastoma antigens like EGFR and CD276, where computational optimization yields >100-fold specificity gains. These developments underscore protein design's role in advancing precision medicine, with ongoing refinements addressing stability and manufacturability.

Materials and non-biomedical uses

Protein design has enabled the creation of self-assembling nanostructures for applications in and biomaterials, where precise control over assembly pathways yields materials with tailored geometries and functions. One prominent example involves the computational design of icosahedral protein shells, such as those reported in 2021, which utilize symmetric arrangements of protein subunits to form closed polyhedral cages up to 120 subunits in size, exhibiting high stability and potential for encapsulating in . These designs leverage hierarchical to minimize off-pathway aggregates, facilitating scalable production for non-biological uses like nanoscale reactors. Similarly, amyloid-like have been engineered de novo from combinatorial libraries, forming stable β-sheet structures that mimic natural amyloids but with customizable lengths and mechanical properties for use in composite materials. Such , derived from food proteins or synthetic peptides, provide exceptional resistance to denaturation, enabling their integration into durable scaffolds for environmental or industrial applications. Beyond structural designs, specific proteins have been tailored for sensing and material fabrication. De novo luciferases, designed using deep learning in 2023, offer compact, stable enzymes that emit bright bioluminescence in response to substrates, serving as components in industrial biosensors for real-time monitoring of chemical processes. These proteins, as small as 117 amino acids, outperform natural counterparts in stability under harsh conditions, making them suitable for non-medical detection systems. In textile applications, silk-inspired proteins engineered via AI-driven methods replicate the hierarchical β-sheet and amorphous domains of natural spider silk, yielding fibers with high tensile strength for sustainable fabrics. These recombinant silks, produced from microbial hosts, exhibit biocompatibility and biodegradability, addressing demands for eco-friendly alternatives in manufacturing. Designed protein materials often feature tunable mechanical properties, with Young's moduli ranging from 1 to 10 GPa achieved through optimization of secondary structures and interfaces, allowing customization for load-bearing applications like structural composites. For instance, engineered protein fibers can reach moduli of approximately 4.9 GPa while maintaining elasticity, surpassing many synthetic polymers in . Additionally, responsiveness to environmental stimuli enhances functionality; pH-sensitive helical bundles, designed in , undergo reversible assembly-disassembly at physiological ranges, enabling adaptive materials for or sensing. Light-responsive protein hydrogels, incorporating photo-switchable domains, transition between liquid and solid states upon irradiation, facilitating on-demand reshaping in processes. In industrial contexts, protein design optimizes enzymes for production, such as cellulases engineered for enhanced of into fermentable sugars, improving in conversion pathways. These modifications, including for better substrate binding, boost activity under high-temperature conditions typical of biorefineries. Designed proteins further support purification technologies, with computationally optimized helical bundles forming selective channels in bilayers to facilitate or solute separation in systems. Recent advances as of 2025 include scaffolds designed for non-therapeutic applications, such as modular protein assemblies that serve as robust platforms for multivalent display in catalytic or sensing arrays, leveraging for precise geometry control. Computational approaches have also enabled responsive hydrogels, where de novo proteins with programmable interactions form networks that swell or stiffen in response to stimuli, filling gaps in dynamic for industrial encapsulation or delivery of non-biological agents. In 2025, de novo designed enzymes have been developed for degrading persistent pollutants like PFAS and , supporting sustainable . These developments underscore the versatility of protein design in creating sustainable, high-performance materials outside biomedical domains.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.