Hubbry Logo
Protein primary structureProtein primary structureMain
Open search
Protein primary structure
Community hub
Protein primary structure
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Protein primary structure
Protein primary structure
from Wikipedia
Protein primary structureProtein secondary structureProtein tertiary structureProtein quaternary structure
The image above contains clickable links
The image above contains clickable links
This diagram (which is interactive) of protein structure uses PCNA as an example. (PDB: 1AXC​)

Protein primary structure is the linear sequence of amino acids in a peptide or protein.[1] By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesis is most commonly performed by ribosomes in cells. Peptides can also be synthesized in the laboratory. Protein primary structures can be directly sequenced, or inferred from DNA sequences.

Formation

[edit]

Biological

[edit]

Amino acids are polymerised via peptide bonds to form a long backbone, with the different amino acid side chains protruding along it. In biological systems, proteins are produced during translation by a cell's ribosomes. Some organisms can also make short peptides by non-ribosomal peptide synthesis, which often use amino acids other than the encoded 22, and may be cyclised, modified and cross-linked.

Chemical

[edit]

Peptides can be synthesised chemically via a range of laboratory methods. Chemical methods typically synthesise peptides in the opposite order (starting at the C-terminus) to biological protein synthesis (starting at the N-terminus).

Notation

[edit]

Protein sequence is typically notated as a string of letters, listing the amino acids starting at the amino-terminal end through to the carboxyl-terminal end. Either a three letter code or single letter code can be used to represent the 22 naturally encoded amino acids, as well as mixtures or ambiguous amino acids (similar to nucleic acid notation).[1][2][3]

Peptides can be directly sequenced, or inferred from DNA sequences. Large sequence databases now exist that collate known protein sequences.

22 natural amino acid notation
Amino Acid 3-Letter[4] 1-Letter[4]
Alanine Ala A
Arginine Arg R
Asparagine Asn N
Aspartic acid Asp D
Cysteine Cys C
Glutamic acid Glu E
Glutamine Gln Q
Glycine Gly G
Histidine His H
Isoleucine Ile I
Leucine Leu L
Lysine Lys K
Methionine Met M
Phenylalanine Phe F
Proline Pro P
Pyrrolysine Pyl O
Selenocysteine Sec U
Serine Ser S
Threonine Thr T
Tryptophan Trp W
Tyrosine Tyr Y
Valine Val V
Ambiguous amino acid notation
Symbol Description Residues represented
X Any amino acid, or unknown All
B Aspartate or Asparagine D, N
Z Glutamate or Glutamine E, Q
J Leucine or Isoleucine I, L
Φ Hydrophobic V, I, L, F, W, M
Ω Aromatic F, W, Y, H
Ψ Aliphatic V, I, L, M
π Small P, G, A, S
ζ Hydrophilic S, T, H, N, Q, E, D, K, R, Y
+ Positively charged K, R, H
- Negatively charged D, E

Modification

[edit]

In general, polypeptides are unbranched polymers, so their primary structure can often be specified by the sequence of amino acids along their backbone. However, proteins can become cross-linked, most commonly by disulfide bonds, and the primary structure also requires specifying the cross-linking atoms, e.g., specifying the cysteines involved in the protein's disulfide bonds. Other crosslinks include desmosine.

Isomerisation

[edit]

The chiral centers of a polypeptide chain can undergo racemization. Although it does not change the sequence, it does affect the chemical properties of the sequence. In particular, the L-amino acids normally found in proteins can spontaneously isomerize at the atom to form D-amino acids, which cannot be cleaved by most proteases. Additionally, proline can form stable trans-isomers at the peptide bond.

Post-translational modification

[edit]

Additionally, the protein can undergo a variety of post-translational modifications, which are briefly summarized here.

The N-terminal amino group of a polypeptide can be modified covalently, e.g.,

Fig. 1 N-terminal acetylation
  • acetylation
The positive charge on the N-terminal amino group may be eliminated by changing it to an acetyl group (N-terminal blocking).
  • formylation
The N-terminal methionine usually found after translation has an N-terminus blocked with a formyl group. This formyl group (and sometimes the methionine residue itself, if followed by Gly or Ser) is removed by the enzyme deformylase.
  • pyroglutamate
Fig. 2 Formation of pyroglutamate from an N-terminal glutamine
An N-terminal glutamine can attack itself, forming a cyclic pyroglutamate group.
  • myristoylation
Similar to acetylation. Instead of a simple methyl group, the myristoyl group has a tail of 14 hydrophobic carbons, which make it ideal for anchoring proteins to cellular membranes.

The C-terminal carboxylate group of a polypeptide can also be modified, e.g.,

Fig. 3 C-terminal amidation
  • amination (see Figure)
The C-terminus can also be blocked (thus, neutralizing its negative charge) by amination.
  • glycosyl phosphatidylinositol (GPI) attachment
Glycosyl phosphatidylinositol(GPI) is a large, hydrophobic phospholipid prosthetic group that anchors proteins to cellular membranes. It is attached to the polypeptide C-terminus through an amide linkage that then connects to ethanolamine, thence to sundry sugars and finally to the phosphatidylinositol lipid moiety.

Finally, the peptide side chains can also be modified covalently, e.g.,

  • phosphorylation
Aside from cleavage, phosphorylation is perhaps the most important chemical modification of proteins. A phosphate group can be attached to the sidechain hydroxyl group of serine, threonine and tyrosine residues, adding a negative charge at that site and producing an unnatural amino acid. Such reactions are catalyzed by kinases and the reverse reaction is catalyzed by phosphatases. The phosphorylated tyrosines are often used as "handles" by which proteins can bind to one another, whereas phosphorylation of Ser/Thr often induces conformational changes, presumably because of the introduced negative charge. The effects of phosphorylating Ser/Thr can sometimes be simulated by mutating the Ser/Thr residue to glutamate.
A catch-all name for a set of very common and very heterogeneous chemical modifications. Sugar moieties can be attached to the sidechain hydroxyl groups of Ser/Thr or to the sidechain amide groups of Asn. Such attachments can serve many functions, ranging from increasing solubility to complex recognition. All glycosylation can be blocked with certain inhibitors, such as tunicamycin.
In this modification, an asparagine or aspartate side chain attacks the following peptide bond, forming a symmetrical succinimide intermediate. Hydrolysis of the intermediate produces either aspartate or the β-amino acid, iso(Asp). For asparagine, either product results in the loss of the amide group, hence "deamidation".
Proline residues may be hydroxylated at either of two atoms, as can lysine (at one atom). Hydroxyproline is a critical component of collagen, which becomes unstable upon its loss. The hydroxylation reaction is catalyzed by an enzyme that requires ascorbic acid (vitamin C), deficiencies in which lead to many connective-tissue diseases such as scurvy.
Several protein residues can be methylated, most notably the positive groups of lysine and arginine. Arginine residues interact with the nucleic acid phosphate backbone and commonly form hydrogen bonds with the base residues, particularly guanine, in protein–DNA complexes. Lysine residues can be singly, doubly and even triply methylated. Methylation does not alter the positive charge on the side chain, however.
Acetylation of the lysine amino groups is chemically analogous to the acetylation of the N-terminus. Functionally, however, the acetylation of lysine residues is used to regulate the binding of proteins to nucleic acids. The cancellation of the positive charge on the lysine weakens the electrostatic attraction for the (negatively charged) nucleic acids.
  • sulfation
Tyrosines may become sulfated on their atom. Somewhat unusually, this modification occurs in the Golgi apparatus, not in the endoplasmic reticulum. Similar to phosphorylated tyrosines, sulfated tyrosines are used for specific recognition, e.g., in chemokine receptors on the cell surface. As with phosphorylation, sulfation adds a negative charge to a previously neutral site.
  • prenylation and palmitoylation
The hydrophobic isoprene (e.g., farnesyl, geranyl, and geranylgeranyl groups) and palmitoyl groups may be added to the atom of cysteine residues to anchor proteins to cellular membranes. Unlike the GPI and myritoyl anchors, these groups are not necessarily added at the termini.
  • carboxylation
A relatively rare modification that adds an extra carboxylate group (and, hence, a double negative charge) to a glutamate side chain, producing a Gla residue. This is used to strengthen the binding to "hard" metal ions such as calcium.
  • ADP-ribosylation
The large ADP-ribosyl group can be transferred to several types of side chains within proteins, with heterogeneous effects. This modification is a target for the powerful toxins of disparate bacteria, e.g., Vibrio cholerae, Corynebacterium diphtheriae and Bordetella pertussis.
Various full-length, folded proteins can be attached at their C-termini to the sidechain ammonium groups of lysines of other proteins. Ubiquitin is the most common of these, and usually signals that the ubiquitin-tagged protein should be degraded.

Most of the polypeptide modifications listed above occur post-translationally, i.e., after the protein has been synthesized on the ribosome, typically occurring in the endoplasmic reticulum, a subcellular organelle of the eukaryotic cell.

Many other chemical reactions (e.g., cyanylation) have been applied to proteins by chemists, although they are not found in biological systems.

Cleavage and ligation

[edit]

In addition to those listed above, the most important modification of primary structure is peptide cleavage (by chemical hydrolysis or by proteases). Proteins are often synthesized in an inactive precursor form; typically, an N-terminal or C-terminal segment blocks the active site of the protein, inhibiting its function. The protein is activated by cleaving off the inhibitory peptide.

Some proteins even have the power to cleave themselves. Typically, the hydroxyl group of a serine (rarely, threonine) or the thiol group of a cysteine residue will attack the carbonyl carbon of the preceding peptide bond, forming a tetrahedrally bonded intermediate [classified as a hydroxyoxazolidine (Ser/Thr) or hydroxythiazolidine (Cys) intermediate]. This intermediate tends to revert to the amide form, expelling the attacking group, since the amide form is usually favored by free energy, (presumably due to the strong resonance stabilization of the peptide group). However, additional molecular interactions may render the amide form less stable; the amino group is expelled instead, resulting in an ester (Ser/Thr) or thioester (Cys) bond in place of the peptide bond. This chemical reaction is called an N-O acyl shift.

The ester/thioester bond can be resolved in several ways:

  • Simple hydrolysis will split the polypeptide chain, where the displaced amino group becomes the new N-terminus. This is seen in the maturation of glycosylasparaginase.
  • A β-elimination reaction also splits the chain, but results in a pyruvoyl group at the new N-terminus. This pyruvoyl group may be used as a covalently attached catalytic cofactor in some enzymes, especially decarboxylases such as S-adenosylmethionine decarboxylase (SAMDC) that exploit the electron-withdrawing power of the pyruvoyl group.
  • Intramolecular transesterification, resulting in a branched polypeptide. In inteins, the new ester bond is broken by an intramolecular attack by the soon-to-be C-terminal asparagine.
  • Intermolecular transesterification can transfer a whole segment from one polypeptide to another, as is seen in the Hedgehog protein autoprocessing.

History

[edit]

The proposal that proteins were linear chains of α-amino acids was made nearly simultaneously by two scientists at the same conference in 1902, the 74th meeting of the Society of German Scientists and Physicians, held in Karlsbad. Franz Hofmeister made the proposal in the morning, based on his observations of the biuret reaction in proteins. Hofmeister was followed a few hours later by Emil Fischer, who had amassed a wealth of chemical details supporting the peptide-bond model. For completeness, the proposal that proteins contained amide linkages was made as early as 1882 by the French chemist E. Grimaux.[5]

Despite these data and later evidence that proteolytically digested proteins yielded only oligopeptides, the idea that proteins were linear, unbranched polymers of amino acids was not accepted immediately. Some scientists such as William Astbury doubted that covalent bonds were strong enough to hold such long molecules together; they feared that thermal agitations would shake such long molecules asunder. Hermann Staudinger faced similar prejudices in the 1920s when he argued that rubber was composed of macromolecules.[5]

Thus, several alternative hypotheses arose. The colloidal protein hypothesis stated that proteins were colloidal assemblies of smaller molecules. This hypothesis was disproved in the 1920s by ultracentrifugation measurements by Theodor Svedberg that showed that proteins had a well-defined, reproducible molecular weight and by electrophoretic measurements by Arne Tiselius that indicated that proteins were single molecules. A second hypothesis, the cyclol hypothesis advanced by Dorothy Wrinch, proposed that the linear polypeptide underwent a chemical cyclol rearrangement C=O + HN C(OH)-N that crosslinked its backbone amide groups, forming a two-dimensional fabric. Other primary structures of proteins were proposed by various researchers, such as the diketopiperazine model of Emil Abderhalden and the pyrrol/piperidine model of Troensegaard in 1942. Although never given much credence, these alternative models were finally disproved when Frederick Sanger successfully sequenced insulin[when?] and by the crystallographic determination of myoglobin and hemoglobin by Max Perutz and John Kendrew[when?].

Relation to secondary and tertiary structure

[edit]

The primary structure of a biological polymer to a large extent determines the three-dimensional shape (tertiary structure). Protein sequence can be used to predict local features, such as segments of secondary structure, or trans-membrane regions. However, the complexity of protein folding currently prohibits predicting the tertiary structure of a protein from its sequence alone. Knowing the structure of a similar homologous sequence (for example a member of the same protein family) allows highly accurate prediction of the tertiary structure by homology modeling. If the full-length protein sequence is available, it is possible to estimate its general biophysical properties, such as its isoelectric point.

See also

[edit]

Notes and references

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The primary structure of a protein is defined as the linear sequence of amino acids linked together by peptide bonds to form a polypeptide chain. This sequence is specified from the amino-terminal (N-terminus) end to the carboxyl-terminal (C-terminus) end, and it consists of one of 20 standard amino acids, each contributing unique side chains that influence the protein's properties. The primary structure serves as the foundational level of protein organization, encoding all the information necessary for the protein to fold into its functional three-dimensional form. The primary structure is critical because it determines the higher levels of protein organization, including secondary, tertiary, and quaternary structures, through interactions among amino acid side chains such as hydrogen bonds, ionic bonds, hydrophobic effects, and disulfide bridges. These interactions dictate the protein's overall , stability, and biological function, enabling roles in processes like enzymatic , molecular , , and cellular signaling. Even a single amino acid substitution in the primary sequence can disrupt folding and lead to loss of function or pathological conditions, as seen in genetic mutations. In living organisms, the primary structure is established during protein synthesis through transcription of DNA into messenger RNA (mRNA) and subsequent translation at ribosomes, where transfer RNA (tRNA) molecules deliver specific amino acids according to the mRNA codon sequence. Historically, primary structures were determined using methods like Edman degradation for sequential amino acid identification, but modern techniques rely on mass spectrometry and automated DNA sequencing to infer protein sequences from corresponding genes. This level of structure is unique to each protein and is conserved across species for homologous proteins, underscoring its evolutionary and functional significance.

Definition and Fundamentals

Definition

The primary structure of a protein refers to the linear sequence of covalently linked by bonds to form a polypeptide chain. This sequence is conventionally described from the amino (, where the free amino group is located, to the carboxyl (, where the free carboxyl group resides. The key components of this structure include the 20 standard encoded by the , which are joined through their alpha-amino and alpha-carboxyl groups via bonds, resulting in a unbranched chain unless post-translational modifications occur. These bonds are amide linkages formed by dehydration synthesis, creating a rigid, planar backbone that defines the one-dimensional nature of the primary structure. Unlike higher levels of protein organization, the primary structure represents the simplest, sequential arrangement without considering spatial folding, hydrogen bonding patterns, or non-covalent interactions that give rise to secondary, tertiary, or structures. For instance, in the insulin, the primary structure consists of two distinct polypeptide chains—A (21 ) and B (30 )—that emerge as separate sequences after enzymatic cleavage of a precursor protein, proinsulin, though they are later connected by bonds in the mature form.

Biological Importance

The primary structure of a protein, defined by its linear of , is fundamental to its biological function as it determines the higher-order folding and thus the precise three-dimensional arrangement necessary for activity. Specific sequences enable the formation of active sites in enzymes, where catalytic residues interact with substrates to facilitate reactions, while also influencing binding affinities for ligands, cofactors, or other molecules through complementary physicochemical properties of side chains. For instance, the arrangement of polar, nonpolar, acidic, or basic in the primary dictates interactions that stabilize secondary structures like alpha helices or beta sheets, ultimately positioning residues for enzymatic catalysis or molecular recognition. The vast combinatorial diversity arising from the 20 standard allows for an enormous repertoire of proteins, far exceeding the needs of any and underpinning biological . For a typical protein of 100 residues, the theoretical number of possible sequences exceeds 10^130, enabling the of specialized functions tailored to diverse cellular environments. This provides the raw material for , where —such as single polymorphisms or insertions/deletions—alter the primary structure, potentially conferring adaptive advantages like enhanced stability or novel binding properties in response to environmental pressures. Alterations in primary structure due to mutations can also lead to pathological conditions by disrupting normal protein function. A classic example is sickle cell anemia, caused by a single point mutation in the β-globin gene that substitutes glutamic acid (Glu) with valine (Val) at the sixth position of the hemoglobin β-chain, resulting in abnormal hemoglobin polymerization, red blood cell deformation, and impaired oxygen transport. Such changes highlight how even minor sequence variations can cascade into severe diseases, emphasizing the primary structure's role in maintaining physiological homeostasis.

Synthesis

Biological Synthesis

The biological synthesis of a protein's primary structure occurs through the process of , in which the sequence of (mRNA) is decoded by ribosomes to assemble a linear polypeptide chain from . Ribosomes, composed of (rRNA) and proteins, serve as the that facilitate this decoding, while (tRNA) molecules act as adaptors, each carrying a specific and bearing an anticodon that base-pairs with complementary codons on the mRNA. This codon-anticodon recognition ensures that the sequence of three- codons in the mRNA directly dictates the order of in the protein, establishing the primary structure with high precision. The underlying this process is a non-overlapping triplet code, where successive groups of three (codons) in the mRNA are read sequentially without overlap, each specifying one of the 20 standard or serving as a signal for termination. This code exhibits degeneracy, meaning that most are encoded by multiple synonymous codons (up to six for some, like ), which provides redundancy and robustness against certain mutations. The AUG universally initiates by coding for in prokaryotes or in eukaryotes, while the code's triplet nature was experimentally established through frame-shift and decoding studies in the . Translation proceeds in three main phases: initiation, elongation, and termination. During , the small ribosomal subunit binds to the mRNA at the 5' cap (in eukaryotes) or Shine-Dalgarno sequence (in prokaryotes), scans to the AUG start codon, and assembles with the large subunit and initiator tRNA to form the 70S (prokaryotes) or (eukaryotes) initiation complex, aided by initiation factors like in eukaryotes. Elongation follows, with the ribosome's peptidyl (P) site holding the growing chain and the aminoacyl (A) site accepting the next cognate ; peptide bond formation occurs via the ribosome's peptidyl transferase activity, transferring the nascent chain to the new , after which elongation factor-driven translocation moves the ribosome three along the mRNA, ejecting the deacylated tRNA from the exit (E) site. Termination is triggered upon arrival of a (UAA, UAG, or UGA) in the A site, which is recognized by release factors (e.g., RF1/RF2 in prokaryotes or eRF1 in eukaryotes), leading to hydrolytic release of the completed polypeptide from the tRNA and dissociation of the ribosomal subunits. To maintain the of primary structure formation, several mechanisms ensure accurate codon decoding and incorporation, with overall error rates held to approximately 1 in 10^4 . synthetases (aaRSs) play a central role by catalyzing the specific attachment of to their tRNAs, achieving initial specificity through recognition but relying on (editing) domains to hydrolyze misactivated or misacylated tRNAs, reducing error rates from potential 1 in 200 misactivations to 1 in 10^4 or lower. Additional checks occur at the , including induced fit conformational changes that discriminate against near- tRNAs and kinetic during GTP by elongation factors, collectively minimizing mistranslation that could disrupt protein function.

Chemical Synthesis

Chemical synthesis of protein primary structures enables the laboratory assembly of polypeptides with defined sequences, distinct from biological processes. The cornerstone method is solid-phase peptide synthesis (SPPS), introduced by Robert Bruce Merrifield in 1963, which facilitates the stepwise construction of peptide chains anchored to an insoluble resin support. This approach allows for automated synthesis, where are added sequentially from the to the , enabling precise control over the primary structure. In SPPS, protected amino acids—typically with N-terminal Boc or Fmoc groups and side-chain protections—are employed to prevent unwanted reactions. The process involves iterative cycles of activation, coupling, and deprotection. Activation converts the carboxyl group of the incoming amino acid into a reactive species, often using carbodiimides such as dicyclohexylcarbodiimide (DCC) to form an O-acylisourea intermediate, which promotes efficient amide bond formation. Coupling attaches this activated amino acid to the free N-terminal amine of the resin-bound peptide chain, typically achieving per-step yields exceeding 99% under optimized conditions. Deprotection then removes the N-terminal protecting group—e.g., via acid treatment for Boc or base for Fmoc—exposing the amine for the next cycle, while the resin facilitates easy separation of byproducts through filtration. Upon completion, the peptide is cleaved from the resin (e.g., using hydrogen fluoride for Boc chemistry) and purified, commonly by reversed-phase high-performance liquid chromatography (HPLC) to isolate the target sequence with high purity. Despite its efficiency, SPPS has practical limitations. The cumulative effect of incomplete couplings leads to a practical length limit of up to 50-100 residues, beyond which overall yields drop significantly due to side reactions and aggregation on the resin. Racemization, the partial conversion of L-amino acids to D-isomers during activation and coupling, poses another risk, particularly with certain residues like cysteine or serine, necessitating careful selection of reagents and conditions to minimize stereochemical integrity loss below 1%. SPPS has transformative applications in producing therapeutic peptides with custom primary structures. For instance, oxytocin, a nonapeptide hormone, was among the early successes synthesized via SPPS in the late , demonstrating the method's viability for biologically active molecules now used in clinical settings for and postpartum hemorrhage treatment. This capability has expanded to over 100 FDA-approved drugs as of 2024, underscoring SPPS's role in pharmaceutical development.

Determination

Classical Methods

The classical methods for determining protein primary structure relied on chemical labeling, selective hydrolysis, and chromatographic analysis to identify amino acid sequences step by step, primarily developed in the mid-20th century. One foundational approach was end-group labeling, pioneered by Frederick Sanger, which targeted the N-terminal amino acid of a polypeptide chain. In this method, the protein is reacted with 2,4-dinitrofluorobenzene (DNFB), also known as Sanger's reagent, to form a stable dinitrophenyl (DNP) derivative at the free amino group of the N-terminal residue. The labeled protein is then subjected to complete acid hydrolysis, which cleaves all peptide bonds, releasing individual amino acids, including the DNP-labeled N-terminal one, which can be identified and quantified through chromatography due to its distinctive yellow color and solubility properties. This technique allowed determination of the N-terminal residue but was limited to end groups and required additional strategies for internal sequences. To address the need for sequential analysis beyond just end groups, Pehr Edman introduced a degradation method in 1950 that enabled stepwise removal of N-terminal residues from intact peptides. The process involves treating the peptide with phenylisothiocyanate (PITC), which reacts specifically with the N-terminal amino group to form a phenylthiocarbamyl (PTC) derivative. Mild acid treatment then cleaves this derivative as a phenylthiohydantoin (PTH) amino acid, leaving the rest of the peptide chain intact for further cycles of reaction. Each released PTH-amino acid is identified by , typically or thin-layer, allowing sequences of up to 50-60 residues to be determined manually with high specificity, though yields decreased in later cycles due to incomplete reactions. complemented end-group labeling by providing a cyclic, non-destructive way to elucidate longer stretches of the primary structure. For larger proteins, where direct sequencing of the full chain was impractical, proteolytic digestion with specific enzymes was employed to fragment the polypeptide into smaller, overlapping whose sequences could be individually determined and then assembled. Enzymes like , which cleaves bonds after and residues, or , which targets aromatic , were used to generate predictable fragments. These were separated by or , sequenced using end-group or Edman methods, and aligned based on overlaps from multiple digests with different enzymes or partial acid . This overlap strategy was essential for reconstructing the complete sequence, as it resolved ambiguities in fragment order. These methods culminated in the first complete determination of a protein's primary structure: the sequencing of bovine insulin by Sanger's group in the early . Insulin, a 51-residue with two disulfide-linked chains, was oxidized to separate the A (21 residues) and B (30 residues) chains, then fragmented using , , and partial acid hydrolysis to yield over 50 peptides. Sequencing these via DNP labeling and revealed the exact order, including the positions of three interchain and two intrachain bonds, confirming that proteins possess a defined linear sequence of . This landmark achievement, published in 1951 for the B chain and 1953 for the A chain, established the genetic specificity of and earned Sanger the 1958 .

Modern Techniques

Modern techniques for determining protein primary structure have advanced significantly since the late 20th century, enabling high-throughput analysis of complex proteomes and direct sequencing of peptides. These methods leverage , genomic sequencing, and computational tools to achieve greater speed, sensitivity, and scalability compared to earlier approaches, often integrating multiple technologies in pipelines. Mass spectrometry (MS) stands as a of contemporary , particularly through tandem MS (MS/MS), which fragments peptides to generate sequence-specific ions for identification. In MS/MS workflows, proteins are digested into peptides, ionized, and subjected to or other fragmentation techniques to produce daughter ions whose mass-to-charge ratios reveal order via database matching or de novo sequencing algorithms. This approach excels in resolving ambiguous sequences and handling mixtures, with de novo sequencing particularly useful for novel proteins lacking genomic references. (ESI) and (MALDI) serve as key ionization methods; ESI produces multiply charged ions suitable for online coupling with liquid chromatography (LC), while MALDI generates singly charged ions ideal for imaging and high-molecular-weight analysis. ESI's soft ionization preserves labile modifications, enabling detection of post-translational modifications (PTMs) alongside primary sequence. Next-generation sequencing (NGS) provides an indirect yet powerful route to protein primary structure by determining DNA or RNA sequences, which are translated into amino acid sequences using the genetic code. NGS platforms, such as those from Illumina or Ion Torrent, parallelize millions of sequencing reads to assemble genomes or transcriptomes rapidly, allowing inference of coding regions (exons) and their codon-based protein products. This method is especially valuable for organisms with sequenced genomes, where proteome-wide sequences can be predicted ab initio, though it requires validation against direct protein data to account for splicing variants or errors. For example, NGS-enabled whole-genome sequencing has facilitated the annotation of proteomes in model organisms like humans, revealing over 20,000 protein-coding genes. Computational prediction tools complement experimental methods by validating the plausibility of primary sequences by predicting their three-dimensional structures and assessing fold stability. , developed by DeepMind, uses on evolutionary multiple sequence alignments to predict three-dimensional protein structures from input sequences, thereby evaluating if the sequence aligns with biophysical constraints. While not a direct sequencing tool, aids validation in cases of sequencing ambiguity, such as distinguishing isoforms, by scoring how well variants fold into stable structures; for instance, it has achieved high accuracy, with median all-atom RMSD of about 1.5 in benchmarks, for the majority of human proteins. Limitations include reliance on known sequences for input and reduced performance for disordered regions or novel folds. Emerging techniques as of 2025 include , which enables direct, single-molecule analysis of polypeptide chains by detecting ionic current changes as pass through a nanopore. Combined with AI for signal interpretation, these methods offer potential for label-free, high-throughput sequencing of native proteins, addressing limitations of digestion-based approaches. workflows integrate these techniques for large-scale primary structure determination, with liquid chromatography-tandem (LC-MS/MS) as the gold standard for bottom-up analysis. In a typical LC-MS/MS pipeline, proteins are extracted, reduced, alkylated, and enzymatically digested (e.g., with ) into , which are separated by reversed-phase LC before ESI-MS/MS ionization and fragmentation. Spectral data are searched against databases like using tools such as or MaxQuant for identification and assembly into protein sequences, achieving proteome coverage of 5,000–10,000 proteins per run in complex samples. These workflows also detect PTMs, such as or , by identifying mass shifts in fragment ions, with neutral loss scans enhancing site localization. High-resolution instruments like analyzers provide sub-ppm mass accuracy, enabling confident de novo sequencing even for PTM-bearing peptides.

Representation and Notation

Sequence Notation

Protein primary sequences are conventionally written from the to the , reflecting the direction of polypeptide chain synthesis in biological systems. This left-to-right notation in linear text representations ensures consistency across and databases. Two primary systems exist for denoting in sequences: the three-letter code, which uses abbreviated names like Ala for , and the one-letter code, which employs single characters such as A for . The one-letter code is preferred for compact representation of long sequences, while the three-letter code offers greater readability for shorter segments or when emphasizing specific residues. These abbreviations are standardized by the International Union of Pure and Applied Chemistry (IUPAC) and the International Union of Biochemistry and Molecular Biology (IUBMB). The IUPAC-IUBMB recommendations specify codes for the 20 standard proteinogenic amino acids, as well as non-standard ones incorporated in some proteins, such as (denoted Sec or U) and pyrrolysine (Pyl or O). Below is a table of the standard abbreviations:
Amino AcidThree-Letter CodeOne-Letter Code
AlaA
ArgR
AsnN
AspD
CysC
GlnQ
GluE
GlyG
HisH
IleI
LeuL
LysK
MetM
PheF
ProP
SerineSerS
ThrT
TrpW
TyrY
ValV
For ambiguous residues, such as those undetermined by sequencing, the one-letter code "X" (or three-letter "Xaa") is used to indicate an unknown amino acid. Other ambiguity codes include "B" for aspartic acid (D) or asparagine (N), "Z" for glutamic acid (E) or glutamine (Q), "J" for isoleucine (I) or leucine (L), and "?" for gaps or completely unresolved positions in alignments. In databases, protein sequences are commonly stored and exchanged in the FASTA format, which begins with a header line starting with ">" followed by an identifier, and then the sequence in one-letter code, often wrapped at 60-80 characters per line for readability. The UniProt database, a comprehensive resource for protein sequences and annotations, displays and archives entries using these IUPAC one-letter codes, with canonical sequences serving as the reference for positional numbering. This standardization facilitates computational analysis, alignment, and sharing across bioinformatics tools.

Structural Representations

The primary structure of a protein can be visually represented through linear diagrams that depict the polypeptide chain as a sequential series of connected by bonds, often illustrated as a or straight chain to emphasize the covalent linkages without implying three-dimensional folding. These diagrams typically use circles or beads for residues and lines for bonds, highlighting the N-terminal to C-terminal directionality and allowing identification of specific sequences or motifs. Such representations facilitate understanding of the linear order and potential sites for interactions, as described in standard biochemical illustrations. Sequence logos provide a graphical method to encode the conservation and variability within aligned protein sequences, stacking letters for each position where the height of each symbol reflects its frequency or , measured in bits. Developed originally for nucleic acids but widely adapted for proteins, this approach visually summarizes motifs or domains in primary structures from multiple homologs, with taller stacks indicating higher conservation and color-coding often distinguishing physicochemical properties. For instance, in protein families, sequence logos reveal conserved residues critical for function, such as catalytic sites in enzymes. Software tools like PyMOL enable visualization of the primary chain by rendering the polypeptide backbone as a continuous tube or line trace, often with side chains appended to illustrate the sequence in a linear fashion before applying higher-level representations. PyMOL supports loading sequences from databases like and displaying them as editable chains, useful for annotating specific residues or bonds. Complementing this, Ramachandran plots offer a diagrammatic view of backbone conformational constraints inherent to the primary structure, plotting allowed phi (φ) and psi (ψ) dihedral angles for each residue type to show sterically feasible regions that limit possible chain geometries. These plots, derived from energy calculations, underscore how the sequence of influences local flexibility, with allowing broader ranges due to its small . A representative example is the primary structure of , a 153-residue oxygen-binding protein, often depicted in linear diagrams with predicted helical regions marked as segments A through H (e.g., A: residues 3-18, E: 58-77), illustrating how the sequence predisposes certain stretches to alpha-helical conformations based on composition. This visualization highlights eight helical motifs connected by non-helical loops, aiding in the interpretation of how primary sequence elements contribute to the protein's overall .

Modifications

Post-Translational Modifications

Post-translational modifications (PTMs) are covalent alterations to the amino acid residues of a protein that occur after its ribosomal synthesis, thereby expanding the functional diversity of the primary sequence without altering the polypeptide backbone length. These modifications introduce chemical groups, such as , sugars, or moieties, to specific side chains, influencing protein stability, localization, activity, and interactions. PTMs are primarily enzymatic processes mediated by dedicated enzymes like kinases, glycosyltransferases, and ubiquitin ligases, enabling dynamic regulation of cellular processes including signaling and metabolism. Over 400 distinct types of PTMs have been identified across proteomes as of 2024, with mass spectrometry-based serving as the primary method for their detection and mapping due to its high-throughput capability in identifying modification sites and stoichiometries. Among the most prevalent are , , ubiquitination, and , each targeting specific residues and serving regulatory roles. involves the addition of a group from ATP to serine (Ser), (Thr), or (Tyr) residues, catalyzed by protein kinases, which reversibly activates or inhibits enzymatic activity and facilitates pathways. attaches carbohydrate moieties either N-linked to (Asn) in the Asn-X-Ser/Thr (where X is any except ) via oligosaccharyltransferase in the , or O-linked to Ser or Thr by Golgi-resident glycosyltransferases, enhancing , stability, and cell-cell recognition. Ubiquitination conjugates —a 76-amino-acid protein—to (Lys) residues through a cascade of E1-activating, E2-conjugating, and E3-ligase enzymes, often forming polyubiquitin chains that signal proteasomal degradation or alter protein trafficking and interactions. transfers an acetyl group from to the ε-amino group of Lys or the N-terminal amine, executed by histone acetyltransferases (HATs) or non-histone acetyltransferases, which neutralizes positive charges to modulate protein-DNA binding and enzymatic function. These PTMs play critical roles in cellular regulation and signaling; for instance, histone at Lys residues on tails promotes relaxation and transcription in epigenetic control, as demonstrated in studies of activity during development and disease. Similarly, by cyclin-dependent kinases (CDKs) on Thr and Ser residues of cell cycle regulators, such as , drives orderly progression through G1/S and G2/M phases by sequentially activating downstream targets. Such modifications underscore the primary structure's adaptability, with their reversibility—via phosphatases, deglycosylases, deubiquitinases, and deacetylases—allowing rapid responses to environmental cues.

Other Modifications

In addition to post-translational modifications that alter side chains, the primary structure of proteins can undergo other alterations that change the linear sequence or of the polypeptide chain, such as proteolytic cleavage and enzymatic ligation. These processes are essential for protein maturation, , and functional diversification during or after synthesis. Proteolytic cleavage represents a key mechanism for reshaping primary structure by excising segments from the polypeptide chain, often mediated by specific endoproteases. This includes the removal of N-terminal signal peptides during co-translational translocation into the , where signal peptidases cleave the hydrophobic signal sequence to yield the mature protein, ensuring proper localization. In , inactive precursors are converted to active enzymes via limited ; for example, is cleaved by enterokinase at a specific Lys-Ile bond, exposing the and initiating a conformational change that enables . Similarly, in maturation, proinsulin undergoes sequential cleavages by prohormone convertases (PC1/3 and PC2) and carboxypeptidase E in the Golgi and secretory granules, removing the linker to form the mature insulin consisting of A and B chains linked by bonds. These cleavages not only refine the primary sequence but also prevent premature activity and facilitate packaging. Enzymatic ligation counters cleavage by joining polypeptide segments, effectively altering chain connectivity without altering the amino acid sequence itself. Intein-mediated is a prominent example, where inteins—self-splicing protein elements—catalyze their own excision from a precursor protein and simultaneously ligate the flanking extein sequences via a series of nucleophilic attacks, forming a native . This process occurs in various organisms, including and eukaryotes, and is harnessed in for of proteins. Transpeptidation, another ligation mechanism, involves enzymes like sortases or asparaginyl endopeptidases that catalyze the transfer of peptide segments, often using a or acyl intermediate to form new isopeptide or bonds; for instance, archaeal connectase performs sequence-specific transpeptidation to join protein fragments efficiently. These ligations enable the assembly of multidomain proteins from separate modules, expanding functional diversity.30080-1)

Relation to Higher Structures

Secondary Structure

The primary structure of a protein, consisting of its linear sequence of amino acids, fundamentally determines the local folding patterns that form secondary structures such as α-helices and β-sheets through inherent propensities of individual residues. Certain amino acids exhibit strong preferences for specific secondary elements due to their side-chain properties and backbone flexibility; for instance, alanine has a high propensity for α-helices because its small methyl side chain minimizes steric hindrance and stabilizes hydrogen bonding, while valine favors β-sheets owing to its branched aliphatic side chain that promotes hydrophobic packing in extended conformations. In contrast, proline disrupts α-helices and β-sheets but has a strong propensity for β-turns, as its cyclic side chain restricts backbone rotation and introduces a kink essential for reversing chain direction. These propensities are statistically derived from analyses of known protein structures and reflect the physicochemical contributions of side chains to local stability. Prediction of secondary structure from primary sequence relies on empirical parameters that quantify these amino acid preferences, with the Chou-Fasman method providing a foundational approach for assigning α-helices, β-sheets, and turns. In this method, each is assigned propensity values (P_α for helices, P_β for sheets, and P_t for turns) based on their observed frequencies in secondary structures from data; regions where the average P_α exceeds 1.00 over a window of six residues are predicted as helical, while similar thresholds apply for β-sheets, with breakers like terminating elements. The parameters, originally tabulated from 29 non-homologous proteins, enable a rule-based assignment that highlights how sequential patterns of high-propensity residues nucleate and extend secondary elements, achieving accuracies around 50-60% for broad classifications. The formation of secondary structures involves cooperative transitions where the primary sequence governs —the energetically unfavorable initiation of a short structural segment—and —the favorable extension through sequential formation. In α-helix formation, nucleation requires overcoming an entropy penalty for aligning the first few residues, often favored by sequences rich in or at N-terminal positions, while propagation is driven by side-chain interactions that differ from nucleation, such as hydrogen bonding stabilized by residues like glutamate. This process is modeled by the Zimm-Bragg theory, which treats helix-coil transitions as a one-dimensional Ising-like with nucleation parameter σ (typically 10^{-4} to 10^{-2}) and s (>1 for helix-stabilizing residues), illustrating how primary sequence motifs control the balance between and ordered states. A representative example is , where repeating heptad sequences (e.g., (a-b-c-d-e-f-g)_n with hydrophobic residues at a and d positions) promote nucleation of individual α-helices and their propagation into a dimeric coiled-coil dimer, essential for filament assembly in structural proteins.

Tertiary and Quaternary Structures

The primary structure of a protein encodes the information necessary for its native tertiary conformation, as established by , which posits that the determines the thermodynamically stable three-dimensional under physiological conditions. This , derived from experiments on A refolding, implies that the native state represents the global free energy minimum, guiding the protein through a landscape where unfolded ensembles progressively reduce conformational entropy while minimizing energy. In this funnel model, the primary biases the energy landscape to favor productive folding pathways, avoiding kinetic traps that could lead to misfolding. Tertiary structure arises from long-range interactions between distant residues in the primary sequence, primarily driven by the , where non-polar cluster to form a stabilizing core shielded from aqueous solvent. This burial of hydrophobic side chains, such as those from and , contributes the dominant energetic force for folding, with the core providing mechanical rigidity. Covalent bonds between residues further stabilize the tertiary fold by linking spatially separated segments, as observed in the refolding of proteins like , where these bonds lock the structure against . bonds, ionic interactions, and van der Waals contacts between polar and charged residues complement these, fine-tuning the overall architecture dictated by the sequence. In quaternary structures, the primary sequences of individual subunits encode specific motifs that mediate inter-subunit interfaces, enabling assembly into functional complexes. For instance, in , the α and β subunit sequences feature complementary hydrophobic and electrostatic patches at their interfaces, such as the α1β1 contact involving residues like Phe42 (α) and Asp99 (β), which facilitate tetramer formation and . These sequence-determined interfaces bury significant surface area upon assembly, enhancing stability and enabling cooperative functions like oxygen binding at sites coordinated by invariant histidines in the sequences. Advances in computational prediction have leveraged primary sequence to model tertiary and structures with unprecedented accuracy, exemplified by , which uses on sequence alignments to infer 3D coordinates from evolutionary patterns. This AI approach, achieving near-atomic resolution for many globular proteins, underscores how sequence encodes folding by capturing residue-residue distance distributions in the energy landscape. However, predictions falter for , where sequences lack strong evolutionary constraints and fail to form stable cores, resulting in low-confidence models that reflect ensemble dynamics rather than unique folds.

Historical Development

Early Discoveries

The concept of proteins as complex organic substances began to take shape in the 19th century through chemical analyses that revealed their composition. In 1820, French chemist Henri Braconnot conducted early hydrolysis experiments on gelatin using sulfuric acid and heat, breaking it down into a sweet-tasting substance he named "glycine," one of the first identified amino acids, suggesting proteins could be decomposed into simpler components. Building on this, Dutch chemist Gerardus Johannes Mulder analyzed various animal and plant materials in the 1830s, finding they shared a consistent elemental composition rich in nitrogen and carbon; upon Jöns Jacob Berzelius's suggestion, Mulder coined the term "protein" (from the Greek "proteios," meaning primary) in his 1838 publication to describe this ubiquitous substance, proposing it as a fundamental building block of life. Mulder further demonstrated through alkaline hydrolysis with sodium hydroxide that proteins from sources like egg white, blood, and gluten yielded similar nitrogenous products, hinting at a polymeric structure composed of amino acid-like units. Advancing into the early 20th century, German chemist provided critical evidence for the linkage between in proteins. In 1901, Fischer and Ernest Fourneau synthesized the first , glycyl-glycine, by condensing two molecules, demonstrating that could be joined via an amide bond. In 1907, Fischer extended this work by synthesizing longer polypeptides up to 18 and proposed the ""—a specific carboxyl-amino —as the repeating unit connecting in proteins, based on studies that released free from natural proteins matching his synthetic ones. These experiments established proteins as linear chains of , though the exact sequence remained unknown. Fischer's collaborator, Emil Abderhalden, contributed to early efforts at characterizing sequences in the 1900s through detailed and enzymatic digestion studies. Joining lab in 1902, Abderhalden analyzed protein hydrolysates using proteolytic enzymes to isolate and identify and tripeptides, such as alanylglycine, confirming their presence in natural proteins and supporting the idea of specific arrangements rather than random aggregates. His work involved partial sequencing by stepwise degradation, revealing recurring motifs in fibroin and other proteins, though limited by the technology of the era. By the 1930s, physical methods began to suggest ordered internal structures within proteins, complementing chemical insights. Austrian-born biophysicist , working at Cambridge's , obtained the first X-ray diffraction patterns of hemoglobin crystals in 1937, after crystallizing the protein from horse blood; these patterns indicated a highly ordered molecular arrangement, implying that the amino acid chain adopted a specific three-dimensional configuration essential for function. 's initial photographs revealed fibrous patterns akin to those in oriented fibers, providing the first evidence that proteins like possessed regular, non-random primary sequences underlying their crystalline order.

Key Milestones

In 1951, and his collaborator Hans Tuppy published the first complete amino acid sequence of the phenylalanyl chain of bovine insulin, marking the inaugural determination of a protein's full primary structure. This breakthrough culminated in the full sequencing of insulin's two chains by 1955, demonstrating that proteins possess defined linear sequences of essential to their function. meticulous use of partial , , and end-group analysis established the foundational principles for , earning him the in 1958 for this pioneering work. The development of in the early 1950s by Swedish biochemist Pehr Edman revolutionized by enabling the sequential removal and identification of N-terminal without disrupting the rest of the polypeptide chain. First described in 1949 and refined through the 1950s, this cyclization method using phenylisothiocyanate allowed for the analysis of up to 50-60 residues in peptides, far surpassing prior techniques limited to short fragments. Its automation in the 1960s and 1970s, particularly through instruments like the spinning-cup sequenator, facilitated the sequencing of longer proteins, accelerating research. The advent of in the 1970s bridged and protein biochemistry, allowing primary structures to be inferred from corresponding gene sequences via the . In 1977, and Allan Maxam introduced a chemical cleavage method that directly sequenced DNA by generating base-specific fragments, while independently developed chain-termination sequencing using dideoxynucleotides, which became the dominant technique for its efficiency and accuracy. These methods enabled the rapid determination of nucleotide orders in genes, thereby predicting sequences in encoded proteins and transforming the study of primary structures from labor-intensive direct protein analysis to genome-driven inference. The completion of the in 2003 provided a reference sequence for the entire , encompassing approximately 20,000 protein-coding genes and enabling large-scale inference. This milestone allowed scientists to deduce the primary structures of virtually all human proteins from genomic data, bypassing traditional sequencing limitations and fostering to identify sequence variations across species. By integrating with post-genomic tools, it laid the groundwork for systematic annotation of protein sequences and their functional implications. Advancements in computational prediction culminated with DeepMind's , which from 2018 onward dramatically enhanced the utility of primary sequences by accurately modeling three-dimensional structures. AlphaFold's debut at the CASP13 competition in 2018 showcased improved sequence-based folding predictions, but its 2020-2021 iteration (AlphaFold 2) achieved near-experimental accuracy for diverse proteins, as validated in CASP14. By 2021, the system had predicted structures for nearly all known protein sequences in public databases, linking primary structure directly to higher-order folds and accelerating and evolutionary studies.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.