Recent from talks
Nothing was collected or created yet.
Protein primary structure
View on Wikipedia

Protein primary structure is the linear sequence of amino acids in a peptide or protein.[1] By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesis is most commonly performed by ribosomes in cells. Peptides can also be synthesized in the laboratory. Protein primary structures can be directly sequenced, or inferred from DNA sequences.
Formation
[edit]Biological
[edit]Amino acids are polymerised via peptide bonds to form a long backbone, with the different amino acid side chains protruding along it. In biological systems, proteins are produced during translation by a cell's ribosomes. Some organisms can also make short peptides by non-ribosomal peptide synthesis, which often use amino acids other than the encoded 22, and may be cyclised, modified and cross-linked.
Chemical
[edit]Peptides can be synthesised chemically via a range of laboratory methods. Chemical methods typically synthesise peptides in the opposite order (starting at the C-terminus) to biological protein synthesis (starting at the N-terminus).
Notation
[edit]Protein sequence is typically notated as a string of letters, listing the amino acids starting at the amino-terminal end through to the carboxyl-terminal end. Either a three letter code or single letter code can be used to represent the 22 naturally encoded amino acids, as well as mixtures or ambiguous amino acids (similar to nucleic acid notation).[1][2][3]
Peptides can be directly sequenced, or inferred from DNA sequences. Large sequence databases now exist that collate known protein sequences.
| Amino Acid | 3-Letter[4] | 1-Letter[4] |
|---|---|---|
| Alanine | Ala | A |
| Arginine | Arg | R |
| Asparagine | Asn | N |
| Aspartic acid | Asp | D |
| Cysteine | Cys | C |
| Glutamic acid | Glu | E |
| Glutamine | Gln | Q |
| Glycine | Gly | G |
| Histidine | His | H |
| Isoleucine | Ile | I |
| Leucine | Leu | L |
| Lysine | Lys | K |
| Methionine | Met | M |
| Phenylalanine | Phe | F |
| Proline | Pro | P |
| Pyrrolysine | Pyl | O |
| Selenocysteine | Sec | U |
| Serine | Ser | S |
| Threonine | Thr | T |
| Tryptophan | Trp | W |
| Tyrosine | Tyr | Y |
| Valine | Val | V |
| Symbol | Description | Residues represented |
|---|---|---|
| X | Any amino acid, or unknown | All |
| B | Aspartate or Asparagine | D, N |
| Z | Glutamate or Glutamine | E, Q |
| J | Leucine or Isoleucine | I, L |
| Φ | Hydrophobic | V, I, L, F, W, M |
| Ω | Aromatic | F, W, Y, H |
| Ψ | Aliphatic | V, I, L, M |
| π | Small | P, G, A, S |
| ζ | Hydrophilic | S, T, H, N, Q, E, D, K, R, Y |
| + | Positively charged | K, R, H |
| - | Negatively charged | D, E |
Modification
[edit]In general, polypeptides are unbranched polymers, so their primary structure can often be specified by the sequence of amino acids along their backbone. However, proteins can become cross-linked, most commonly by disulfide bonds, and the primary structure also requires specifying the cross-linking atoms, e.g., specifying the cysteines involved in the protein's disulfide bonds. Other crosslinks include desmosine.
Isomerisation
[edit]The chiral centers of a polypeptide chain can undergo racemization. Although it does not change the sequence, it does affect the chemical properties of the sequence. In particular, the L-amino acids normally found in proteins can spontaneously isomerize at the atom to form D-amino acids, which cannot be cleaved by most proteases. Additionally, proline can form stable trans-isomers at the peptide bond.
Post-translational modification
[edit]Additionally, the protein can undergo a variety of post-translational modifications, which are briefly summarized here.
The N-terminal amino group of a polypeptide can be modified covalently, e.g.,

- acetylation
- The positive charge on the N-terminal amino group may be eliminated by changing it to an acetyl group (N-terminal blocking).
- formylation
- The N-terminal methionine usually found after translation has an N-terminus blocked with a formyl group. This formyl group (and sometimes the methionine residue itself, if followed by Gly or Ser) is removed by the enzyme deformylase.
- pyroglutamate

- An N-terminal glutamine can attack itself, forming a cyclic pyroglutamate group.
- myristoylation
- Similar to acetylation. Instead of a simple methyl group, the myristoyl group has a tail of 14 hydrophobic carbons, which make it ideal for anchoring proteins to cellular membranes.
The C-terminal carboxylate group of a polypeptide can also be modified, e.g.,

- amination (see Figure)
- The C-terminus can also be blocked (thus, neutralizing its negative charge) by amination.
- glycosyl phosphatidylinositol (GPI) attachment
- Glycosyl phosphatidylinositol(GPI) is a large, hydrophobic phospholipid prosthetic group that anchors proteins to cellular membranes. It is attached to the polypeptide C-terminus through an amide linkage that then connects to ethanolamine, thence to sundry sugars and finally to the phosphatidylinositol lipid moiety.
Finally, the peptide side chains can also be modified covalently, e.g.,
- phosphorylation
- Aside from cleavage, phosphorylation is perhaps the most important chemical modification of proteins. A phosphate group can be attached to the sidechain hydroxyl group of serine, threonine and tyrosine residues, adding a negative charge at that site and producing an unnatural amino acid. Such reactions are catalyzed by kinases and the reverse reaction is catalyzed by phosphatases. The phosphorylated tyrosines are often used as "handles" by which proteins can bind to one another, whereas phosphorylation of Ser/Thr often induces conformational changes, presumably because of the introduced negative charge. The effects of phosphorylating Ser/Thr can sometimes be simulated by mutating the Ser/Thr residue to glutamate.
- A catch-all name for a set of very common and very heterogeneous chemical modifications. Sugar moieties can be attached to the sidechain hydroxyl groups of Ser/Thr or to the sidechain amide groups of Asn. Such attachments can serve many functions, ranging from increasing solubility to complex recognition. All glycosylation can be blocked with certain inhibitors, such as tunicamycin.
- deamidation (succinimide formation)
- In this modification, an asparagine or aspartate side chain attacks the following peptide bond, forming a symmetrical succinimide intermediate. Hydrolysis of the intermediate produces either aspartate or the β-amino acid, iso(Asp). For asparagine, either product results in the loss of the amide group, hence "deamidation".
- Proline residues may be hydroxylated at either of two atoms, as can lysine (at one atom). Hydroxyproline is a critical component of collagen, which becomes unstable upon its loss. The hydroxylation reaction is catalyzed by an enzyme that requires ascorbic acid (vitamin C), deficiencies in which lead to many connective-tissue diseases such as scurvy.
- Several protein residues can be methylated, most notably the positive groups of lysine and arginine. Arginine residues interact with the nucleic acid phosphate backbone and commonly form hydrogen bonds with the base residues, particularly guanine, in protein–DNA complexes. Lysine residues can be singly, doubly and even triply methylated. Methylation does not alter the positive charge on the side chain, however.
- Acetylation of the lysine amino groups is chemically analogous to the acetylation of the N-terminus. Functionally, however, the acetylation of lysine residues is used to regulate the binding of proteins to nucleic acids. The cancellation of the positive charge on the lysine weakens the electrostatic attraction for the (negatively charged) nucleic acids.
- sulfation
- Tyrosines may become sulfated on their atom. Somewhat unusually, this modification occurs in the Golgi apparatus, not in the endoplasmic reticulum. Similar to phosphorylated tyrosines, sulfated tyrosines are used for specific recognition, e.g., in chemokine receptors on the cell surface. As with phosphorylation, sulfation adds a negative charge to a previously neutral site.
- prenylation and palmitoylation
- The hydrophobic isoprene (e.g., farnesyl, geranyl, and geranylgeranyl groups) and palmitoyl groups may be added to the atom of cysteine residues to anchor proteins to cellular membranes. Unlike the GPI and myritoyl anchors, these groups are not necessarily added at the termini.
- carboxylation
- A relatively rare modification that adds an extra carboxylate group (and, hence, a double negative charge) to a glutamate side chain, producing a Gla residue. This is used to strengthen the binding to "hard" metal ions such as calcium.
- ADP-ribosylation
- The large ADP-ribosyl group can be transferred to several types of side chains within proteins, with heterogeneous effects. This modification is a target for the powerful toxins of disparate bacteria, e.g., Vibrio cholerae, Corynebacterium diphtheriae and Bordetella pertussis.
- Various full-length, folded proteins can be attached at their C-termini to the sidechain ammonium groups of lysines of other proteins. Ubiquitin is the most common of these, and usually signals that the ubiquitin-tagged protein should be degraded.
Most of the polypeptide modifications listed above occur post-translationally, i.e., after the protein has been synthesized on the ribosome, typically occurring in the endoplasmic reticulum, a subcellular organelle of the eukaryotic cell.
Many other chemical reactions (e.g., cyanylation) have been applied to proteins by chemists, although they are not found in biological systems.
Cleavage and ligation
[edit]In addition to those listed above, the most important modification of primary structure is peptide cleavage (by chemical hydrolysis or by proteases). Proteins are often synthesized in an inactive precursor form; typically, an N-terminal or C-terminal segment blocks the active site of the protein, inhibiting its function. The protein is activated by cleaving off the inhibitory peptide.
Some proteins even have the power to cleave themselves. Typically, the hydroxyl group of a serine (rarely, threonine) or the thiol group of a cysteine residue will attack the carbonyl carbon of the preceding peptide bond, forming a tetrahedrally bonded intermediate [classified as a hydroxyoxazolidine (Ser/Thr) or hydroxythiazolidine (Cys) intermediate]. This intermediate tends to revert to the amide form, expelling the attacking group, since the amide form is usually favored by free energy, (presumably due to the strong resonance stabilization of the peptide group). However, additional molecular interactions may render the amide form less stable; the amino group is expelled instead, resulting in an ester (Ser/Thr) or thioester (Cys) bond in place of the peptide bond. This chemical reaction is called an N-O acyl shift.
The ester/thioester bond can be resolved in several ways:
- Simple hydrolysis will split the polypeptide chain, where the displaced amino group becomes the new N-terminus. This is seen in the maturation of glycosylasparaginase.
- A β-elimination reaction also splits the chain, but results in a pyruvoyl group at the new N-terminus. This pyruvoyl group may be used as a covalently attached catalytic cofactor in some enzymes, especially decarboxylases such as S-adenosylmethionine decarboxylase (SAMDC) that exploit the electron-withdrawing power of the pyruvoyl group.
- Intramolecular transesterification, resulting in a branched polypeptide. In inteins, the new ester bond is broken by an intramolecular attack by the soon-to-be C-terminal asparagine.
- Intermolecular transesterification can transfer a whole segment from one polypeptide to another, as is seen in the Hedgehog protein autoprocessing.
History
[edit]The proposal that proteins were linear chains of α-amino acids was made nearly simultaneously by two scientists at the same conference in 1902, the 74th meeting of the Society of German Scientists and Physicians, held in Karlsbad. Franz Hofmeister made the proposal in the morning, based on his observations of the biuret reaction in proteins. Hofmeister was followed a few hours later by Emil Fischer, who had amassed a wealth of chemical details supporting the peptide-bond model. For completeness, the proposal that proteins contained amide linkages was made as early as 1882 by the French chemist E. Grimaux.[5]
Despite these data and later evidence that proteolytically digested proteins yielded only oligopeptides, the idea that proteins were linear, unbranched polymers of amino acids was not accepted immediately. Some scientists such as William Astbury doubted that covalent bonds were strong enough to hold such long molecules together; they feared that thermal agitations would shake such long molecules asunder. Hermann Staudinger faced similar prejudices in the 1920s when he argued that rubber was composed of macromolecules.[5]
Thus, several alternative hypotheses arose. The colloidal protein hypothesis stated that proteins were colloidal assemblies of smaller molecules. This hypothesis was disproved in the 1920s by ultracentrifugation measurements by Theodor Svedberg that showed that proteins had a well-defined, reproducible molecular weight and by electrophoretic measurements by Arne Tiselius that indicated that proteins were single molecules. A second hypothesis, the cyclol hypothesis advanced by Dorothy Wrinch, proposed that the linear polypeptide underwent a chemical cyclol rearrangement C=O + HN C(OH)-N that crosslinked its backbone amide groups, forming a two-dimensional fabric. Other primary structures of proteins were proposed by various researchers, such as the diketopiperazine model of Emil Abderhalden and the pyrrol/piperidine model of Troensegaard in 1942. Although never given much credence, these alternative models were finally disproved when Frederick Sanger successfully sequenced insulin[when?] and by the crystallographic determination of myoglobin and hemoglobin by Max Perutz and John Kendrew[when?].
Relation to secondary and tertiary structure
[edit]The primary structure of a biological polymer to a large extent determines the three-dimensional shape (tertiary structure). Protein sequence can be used to predict local features, such as segments of secondary structure, or trans-membrane regions. However, the complexity of protein folding currently prohibits predicting the tertiary structure of a protein from its sequence alone. Knowing the structure of a similar homologous sequence (for example a member of the same protein family) allows highly accurate prediction of the tertiary structure by homology modeling. If the full-length protein sequence is available, it is possible to estimate its general biophysical properties, such as its isoelectric point.
See also
[edit]Notes and references
[edit]- ^ a b Sanger, F (1952). "The arrangement of amino acids in proteins". In Anson, M.L.; Bailey, Kenneth; Edsall, John T. (eds.). Advances in Protein Chemistry. Vol. 7. pp. 1–67. doi:10.1016/S0065-3233(08)60017-0. PMID 14933251.
- ^ Aasland, Rein; Abrams, Charles; Ampe, Christophe; Ball, Linda J.; Bedford, Mark T.; Cesareni, Gianni; Gimona, Mario; Hurley, James H.; Jarchau, Thomas (2002-02-20). "Normalization of nomenclature for peptide motifs as ligands of modular protein domains". FEBS Letters. 513 (1): 141–144. Bibcode:2002FEBSL.513..141A. doi:10.1016/S0014-5793(01)03295-1. ISSN 1873-3468. PMID 11911894.
- ^ IUPAC-IUB Commission on Biochemical Nomenclature (July 1968). "A One‐Letter Notation for Amino Acid Sequences: Tentative Rules". European Journal of Biochemistry. 5 (2): 151–153. doi:10.1111/j.1432-1033.1968.tb00350.x.
- ^ a b Hausman, Robert E.; Cooper, Geoffrey M. (2004). The cell: a molecular approach. Washington, D.C.: ASM Press. p. 51. ISBN 978-0-87893-214-6.
- ^ a b Fruton, Joseph S. (May 1979). "Early theories of protein structure". Annals of the New York Academy of Sciences. 325 (1): xiv, 1–18. Bibcode:1979NYASA.325....1F. doi:10.1111/j.1749-6632.1979.tb14125.x. PMID 378063. S2CID 39125170.
Protein primary structure
View on GrokipediaDefinition and Fundamentals
Definition
The primary structure of a protein refers to the linear sequence of amino acids covalently linked by peptide bonds to form a polypeptide chain.[1] This sequence is conventionally described from the amino (N)-terminus, where the free amino group is located, to the carboxyl (C)-terminus, where the free carboxyl group resides.[5] The key components of this structure include the 20 standard amino acids encoded by the genetic code, which are joined through their alpha-amino and alpha-carboxyl groups via peptide bonds, resulting in a unbranched chain unless post-translational modifications occur.[3] These peptide bonds are amide linkages formed by dehydration synthesis, creating a rigid, planar backbone that defines the one-dimensional nature of the primary structure.[1] Unlike higher levels of protein organization, the primary structure represents the simplest, sequential arrangement without considering spatial folding, hydrogen bonding patterns, or non-covalent interactions that give rise to secondary, tertiary, or quaternary structures.[6] For instance, in the hormone insulin, the primary structure consists of two distinct polypeptide chains—A (21 amino acids) and B (30 amino acids)—that emerge as separate sequences after enzymatic cleavage of a precursor protein, proinsulin, though they are later connected by disulfide bonds in the mature form.[7]Biological Importance
The primary structure of a protein, defined by its linear sequence of amino acids, is fundamental to its biological function as it determines the higher-order folding and thus the precise three-dimensional arrangement necessary for activity. Specific sequences enable the formation of active sites in enzymes, where catalytic residues interact with substrates to facilitate reactions, while also influencing binding affinities for ligands, cofactors, or other molecules through complementary physicochemical properties of side chains. For instance, the arrangement of polar, nonpolar, acidic, or basic amino acids in the primary sequence dictates interactions that stabilize secondary structures like alpha helices or beta sheets, ultimately positioning residues for enzymatic catalysis or molecular recognition.[1][8] The vast combinatorial diversity arising from the 20 standard amino acids allows for an enormous repertoire of proteins, far exceeding the needs of any organism and underpinning biological complexity. For a typical protein of 100 residues, the theoretical number of possible sequences exceeds 10^130, enabling the evolution of specialized functions tailored to diverse cellular environments. This sequence space provides the raw material for natural selection, where mutations—such as single nucleotide polymorphisms or insertions/deletions—alter the primary structure, potentially conferring adaptive advantages like enhanced stability or novel binding properties in response to environmental pressures.[9][10] Alterations in primary structure due to mutations can also lead to pathological conditions by disrupting normal protein function. A classic example is sickle cell anemia, caused by a single point mutation in the β-globin gene that substitutes glutamic acid (Glu) with valine (Val) at the sixth position of the hemoglobin β-chain, resulting in abnormal hemoglobin polymerization, red blood cell deformation, and impaired oxygen transport. Such changes highlight how even minor sequence variations can cascade into severe diseases, emphasizing the primary structure's role in maintaining physiological homeostasis.[11][1]Synthesis
Biological Synthesis
The biological synthesis of a protein's primary structure occurs through the process of translation, in which the nucleotide sequence of messenger RNA (mRNA) is decoded by ribosomes to assemble a linear polypeptide chain from amino acids. Ribosomes, composed of ribosomal RNA (rRNA) and proteins, serve as the molecular machines that facilitate this decoding, while transfer RNA (tRNA) molecules act as adaptors, each carrying a specific amino acid and bearing an anticodon that base-pairs with complementary codons on the mRNA. This codon-anticodon recognition ensures that the sequence of three-nucleotide codons in the mRNA directly dictates the order of amino acids in the protein, establishing the primary structure with high precision.[12] The genetic code underlying this process is a non-overlapping triplet code, where successive groups of three nucleotides (codons) in the mRNA are read sequentially without overlap, each specifying one of the 20 standard amino acids or serving as a signal for translation termination. This code exhibits degeneracy, meaning that most amino acids are encoded by multiple synonymous codons (up to six for some, like leucine), which provides redundancy and robustness against certain mutations. The start codon AUG universally initiates translation by coding for N-formylmethionine in prokaryotes or methionine in eukaryotes, while the code's triplet nature was experimentally established through frame-shift mutagenesis and in vitro decoding studies in the 1960s.[13] Translation proceeds in three main phases: initiation, elongation, and termination. During initiation, the small ribosomal subunit binds to the mRNA at the 5' cap (in eukaryotes) or Shine-Dalgarno sequence (in prokaryotes), scans to the AUG start codon, and assembles with the large subunit and initiator tRNA to form the 70S (prokaryotes) or 80S (eukaryotes) initiation complex, aided by initiation factors like eIF2 in eukaryotes. Elongation follows, with the ribosome's peptidyl (P) site holding the growing chain and the aminoacyl (A) site accepting the next cognate aminoacyl-tRNA; peptide bond formation occurs via the ribosome's peptidyl transferase activity, transferring the nascent chain to the new amino acid, after which elongation factor-driven translocation moves the ribosome three nucleotides along the mRNA, ejecting the deacylated tRNA from the exit (E) site. Termination is triggered upon arrival of a stop codon (UAA, UAG, or UGA) in the A site, which is recognized by release factors (e.g., RF1/RF2 in prokaryotes or eRF1 in eukaryotes), leading to hydrolytic release of the completed polypeptide from the tRNA and dissociation of the ribosomal subunits.[12][12][12] To maintain the fidelity of primary structure formation, several mechanisms ensure accurate codon decoding and amino acid incorporation, with overall translation error rates held to approximately 1 in 10^4 amino acids. Aminoacyl-tRNA synthetases (aaRSs) play a central role by catalyzing the specific attachment of amino acids to their cognate tRNAs, achieving initial specificity through active site recognition but relying on proofreading (editing) domains to hydrolyze misactivated aminoacyl-adenylates or misacylated tRNAs, reducing error rates from potential 1 in 200 misactivations to 1 in 10^4 or lower. Additional fidelity checks occur at the ribosome, including induced fit conformational changes that discriminate against near-cognate tRNAs and kinetic proofreading during GTP hydrolysis by elongation factors, collectively minimizing mistranslation that could disrupt protein function.[14][14]Chemical Synthesis
Chemical synthesis of protein primary structures enables the laboratory assembly of polypeptides with defined sequences, distinct from biological processes. The cornerstone method is solid-phase peptide synthesis (SPPS), introduced by Robert Bruce Merrifield in 1963, which facilitates the stepwise construction of peptide chains anchored to an insoluble resin support.[15] This approach allows for automated synthesis, where amino acids are added sequentially from the C-terminus to the N-terminus, enabling precise control over the primary structure.[16] In SPPS, protected amino acids—typically with N-terminal Boc or Fmoc groups and side-chain protections—are employed to prevent unwanted reactions. The process involves iterative cycles of activation, coupling, and deprotection. Activation converts the carboxyl group of the incoming amino acid into a reactive species, often using carbodiimides such as dicyclohexylcarbodiimide (DCC) to form an O-acylisourea intermediate, which promotes efficient amide bond formation.[17] Coupling attaches this activated amino acid to the free N-terminal amine of the resin-bound peptide chain, typically achieving per-step yields exceeding 99% under optimized conditions. Deprotection then removes the N-terminal protecting group—e.g., via acid treatment for Boc or base for Fmoc—exposing the amine for the next cycle, while the resin facilitates easy separation of byproducts through filtration.[18] Upon completion, the peptide is cleaved from the resin (e.g., using hydrogen fluoride for Boc chemistry) and purified, commonly by reversed-phase high-performance liquid chromatography (HPLC) to isolate the target sequence with high purity.[16] Despite its efficiency, SPPS has practical limitations. The cumulative effect of incomplete couplings leads to a practical length limit of up to 50-100 residues, beyond which overall yields drop significantly due to side reactions and aggregation on the resin.[19] Racemization, the partial conversion of L-amino acids to D-isomers during activation and coupling, poses another risk, particularly with certain residues like cysteine or serine, necessitating careful selection of reagents and conditions to minimize stereochemical integrity loss below 1%.[20] SPPS has transformative applications in producing therapeutic peptides with custom primary structures. For instance, oxytocin, a nonapeptide hormone, was among the early successes synthesized via SPPS in the late 1960s, demonstrating the method's viability for biologically active molecules now used in clinical settings for labor induction and postpartum hemorrhage treatment.[21] This capability has expanded to over 100 FDA-approved peptide drugs as of 2024, underscoring SPPS's role in pharmaceutical development.[22]Determination
Classical Methods
The classical methods for determining protein primary structure relied on chemical labeling, selective hydrolysis, and chromatographic analysis to identify amino acid sequences step by step, primarily developed in the mid-20th century. One foundational approach was end-group labeling, pioneered by Frederick Sanger, which targeted the N-terminal amino acid of a polypeptide chain. In this method, the protein is reacted with 2,4-dinitrofluorobenzene (DNFB), also known as Sanger's reagent, to form a stable dinitrophenyl (DNP) derivative at the free amino group of the N-terminal residue. The labeled protein is then subjected to complete acid hydrolysis, which cleaves all peptide bonds, releasing individual amino acids, including the DNP-labeled N-terminal one, which can be identified and quantified through chromatography due to its distinctive yellow color and solubility properties. This technique allowed determination of the N-terminal residue but was limited to end groups and required additional strategies for internal sequences. To address the need for sequential analysis beyond just end groups, Pehr Edman introduced a degradation method in 1950 that enabled stepwise removal of N-terminal residues from intact peptides. The process involves treating the peptide with phenylisothiocyanate (PITC), which reacts specifically with the N-terminal amino group to form a phenylthiocarbamyl (PTC) derivative. Mild acid treatment then cleaves this derivative as a phenylthiohydantoin (PTH) amino acid, leaving the rest of the peptide chain intact for further cycles of reaction. Each released PTH-amino acid is identified by chromatography, typically paper or thin-layer, allowing sequences of up to 50-60 residues to be determined manually with high specificity, though yields decreased in later cycles due to incomplete reactions. Edman degradation complemented end-group labeling by providing a cyclic, non-destructive way to elucidate longer stretches of the primary structure. For larger proteins, where direct sequencing of the full chain was impractical, proteolytic digestion with specific enzymes was employed to fragment the polypeptide into smaller, overlapping peptides whose sequences could be individually determined and then assembled. Enzymes like trypsin, which cleaves peptide bonds after lysine and arginine residues, or chymotrypsin, which targets aromatic amino acids, were used to generate predictable fragments. These peptides were separated by chromatography or electrophoresis, sequenced using end-group or Edman methods, and aligned based on overlaps from multiple digests with different enzymes or partial acid hydrolysis. This overlap strategy was essential for reconstructing the complete sequence, as it resolved ambiguities in fragment order. These methods culminated in the first complete determination of a protein's primary structure: the sequencing of bovine insulin by Sanger's group in the early 1950s. Insulin, a 51-residue hormone with two disulfide-linked chains, was oxidized to separate the A (21 residues) and B (30 residues) chains, then fragmented using trypsin, chymotrypsin, and partial acid hydrolysis to yield over 50 peptides. Sequencing these via DNP labeling and chromatography revealed the exact order, including the positions of three interchain and two intrachain disulfide bonds, confirming that proteins possess a defined linear sequence of amino acids. This landmark achievement, published in 1951 for the B chain and 1953 for the A chain, established the genetic specificity of protein structure and earned Sanger the 1958 Nobel Prize in Chemistry.Modern Techniques
Modern techniques for determining protein primary structure have advanced significantly since the late 20th century, enabling high-throughput analysis of complex proteomes and direct sequencing of peptides. These methods leverage mass spectrometry, genomic sequencing, and computational tools to achieve greater speed, sensitivity, and scalability compared to earlier approaches, often integrating multiple technologies in proteomics pipelines.[23] Mass spectrometry (MS) stands as a cornerstone of contemporary protein sequencing, particularly through tandem MS (MS/MS), which fragments peptides to generate sequence-specific ions for identification. In MS/MS workflows, proteins are digested into peptides, ionized, and subjected to collision-induced dissociation or other fragmentation techniques to produce daughter ions whose mass-to-charge ratios reveal amino acid order via database matching or de novo sequencing algorithms. This approach excels in resolving ambiguous sequences and handling mixtures, with de novo sequencing particularly useful for novel proteins lacking genomic references. Electrospray ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI) serve as key ionization methods; ESI produces multiply charged ions suitable for online coupling with liquid chromatography (LC), while MALDI generates singly charged ions ideal for imaging and high-molecular-weight analysis. ESI's soft ionization preserves labile modifications, enabling detection of post-translational modifications (PTMs) alongside primary sequence.[24][25][26] Next-generation sequencing (NGS) provides an indirect yet powerful route to protein primary structure by determining DNA or RNA sequences, which are translated into amino acid sequences using the genetic code. NGS platforms, such as those from Illumina or Ion Torrent, parallelize millions of sequencing reads to assemble genomes or transcriptomes rapidly, allowing inference of coding regions (exons) and their codon-based protein products. This method is especially valuable for organisms with sequenced genomes, where proteome-wide sequences can be predicted ab initio, though it requires validation against direct protein data to account for splicing variants or errors. For example, NGS-enabled whole-genome sequencing has facilitated the annotation of proteomes in model organisms like humans, revealing over 20,000 protein-coding genes.[27][28] Computational prediction tools complement experimental methods by validating the plausibility of primary sequences by predicting their three-dimensional structures and assessing fold stability. AlphaFold, developed by DeepMind, uses deep learning on evolutionary multiple sequence alignments to predict three-dimensional protein structures from input amino acid sequences, thereby evaluating if the sequence aligns with biophysical constraints. While not a direct sequencing tool, AlphaFold aids validation in cases of sequencing ambiguity, such as distinguishing isoforms, by scoring how well variants fold into stable structures; for instance, it has achieved high accuracy, with median all-atom RMSD of about 1.5 Å in benchmarks, for the majority of human proteins. Limitations include reliance on known sequences for input and reduced performance for disordered regions or novel folds.[29][30] Emerging techniques as of 2025 include nanopore-based protein sequencing, which enables direct, single-molecule analysis of polypeptide chains by detecting ionic current changes as amino acids pass through a nanopore. Combined with AI for signal interpretation, these methods offer potential for label-free, high-throughput sequencing of native proteins, addressing limitations of digestion-based approaches.[31] Proteomics workflows integrate these techniques for large-scale primary structure determination, with liquid chromatography-tandem mass spectrometry (LC-MS/MS) as the gold standard for bottom-up analysis. In a typical LC-MS/MS pipeline, proteins are extracted, reduced, alkylated, and enzymatically digested (e.g., with trypsin) into peptides, which are separated by reversed-phase LC before ESI-MS/MS ionization and fragmentation. Spectral data are searched against databases like UniProt using tools such as Mascot or MaxQuant for peptide identification and assembly into protein sequences, achieving proteome coverage of 5,000–10,000 proteins per run in complex samples. These workflows also detect PTMs, such as phosphorylation or glycosylation, by identifying mass shifts in fragment ions, with neutral loss scans enhancing site localization. High-resolution instruments like Orbitrap analyzers provide sub-ppm mass accuracy, enabling confident de novo sequencing even for PTM-bearing peptides.[23][32]Representation and Notation
Sequence Notation
Protein primary sequences are conventionally written from the N-terminus to the C-terminus, reflecting the direction of polypeptide chain synthesis in biological systems.[33] This left-to-right notation in linear text representations ensures consistency across scientific literature and databases.[34] Two primary systems exist for denoting amino acids in sequences: the three-letter code, which uses abbreviated names like Ala for alanine, and the one-letter code, which employs single characters such as A for alanine.[35] The one-letter code is preferred for compact representation of long sequences, while the three-letter code offers greater readability for shorter segments or when emphasizing specific residues.[35] These abbreviations are standardized by the International Union of Pure and Applied Chemistry (IUPAC) and the International Union of Biochemistry and Molecular Biology (IUBMB).[35] The IUPAC-IUBMB recommendations specify codes for the 20 standard proteinogenic amino acids, as well as non-standard ones incorporated in some proteins, such as selenocysteine (denoted Sec or U) and pyrrolysine (Pyl or O).[34] Below is a table of the standard abbreviations:| Amino Acid | Three-Letter Code | One-Letter Code |
|---|---|---|
| Alanine | Ala | A |
| Arginine | Arg | R |
| Asparagine | Asn | N |
| Aspartic acid | Asp | D |
| Cysteine | Cys | C |
| Glutamine | Gln | Q |
| Glutamic acid | Glu | E |
| Glycine | Gly | G |
| Histidine | His | H |
| Isoleucine | Ile | I |
| Leucine | Leu | L |
| Lysine | Lys | K |
| Methionine | Met | M |
| Phenylalanine | Phe | F |
| Proline | Pro | P |
| Serine | Ser | S |
| Threonine | Thr | T |
| Tryptophan | Trp | W |
| Tyrosine | Tyr | Y |
| Valine | Val | V |