Hubbry Logo
Protein sequencingProtein sequencingMain
Open search
Protein sequencing
Community hub
Protein sequencing
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Protein sequencing
Protein sequencing
from Wikipedia
Using a Beckman-Spinco Protein-Peptide Sequencer, 1970

Protein sequencing is the practical process of determining the amino acid sequence of all or part of a protein or peptide. This may serve to identify the protein or characterize its post-translational modifications. Typically, partial sequencing of a protein provides sufficient information (one or more sequence tags) to identify it with reference to databases of protein sequences derived from the conceptual translation of genes.

The two major direct methods of protein sequencing are mass spectrometry and Edman degradation using a protein sequenator (sequencer). Mass spectrometry methods are now the most widely used for protein sequencing and identification but Edman degradation remains a valuable tool for characterizing a protein's N-terminus.

Determining amino acid composition

[edit]
Protein sequence interpretation: a scheme new protein to be engineered in a yeast

It is often desirable to know the unordered amino acid composition of a protein prior to attempting to find the ordered sequence, as this knowledge can be used to facilitate the discovery of errors in the sequencing process or to distinguish between ambiguous results. Knowledge of the frequency of certain amino acids may also be used to choose which protease to use for digestion of the protein. The misincorporation of low levels of non-standard amino acids (e.g. norleucine) into proteins may also be determined.[1] A generalized method often referred to as amino acid analysis[2] for determining amino acid frequency is as follows:

  1. Hydrolyse a known quantity of protein into its constituent amino acids.
  2. Separate and quantify the amino acids in some way.

Hydrolysis

[edit]

Hydrolysis is done by heating a sample of the protein in 6 M hydrochloric acid to 100–110 °C for 24 hours or longer. Proteins with many bulky hydrophobic groups may require longer heating periods. However, these conditions are so vigorous that some amino acids (serine, threonine, tyrosine, tryptophan, glutamine, and cysteine) are degraded. To circumvent this problem, Biochemistry Online suggests heating separate samples for different times, analysing each resulting solution, and extrapolating back to zero hydrolysis time. Rastall suggests a variety of reagents to prevent or reduce degradation, such as thiol reagents or phenol to protect tryptophan and tyrosine from attack by chlorine, and pre-oxidising cysteine. He also suggests measuring the quantity of ammonia evolved to determine the extent of amide hydrolysis.

Separation and quantitation

[edit]

The amino acids can be separated by ion-exchange chromatography then derivatized to facilitate their detection. More commonly, the amino acids are derivatized then resolved by reversed phase HPLC.

An example of the ion-exchange chromatography is given by the NTRC using sulfonated polystyrene as a matrix, adding the amino acids in acid solution and passing a buffer of steadily increasing pH through the column. Amino acids are eluted when the pH reaches their respective isoelectric points. Once the amino acids have been separated, their respective quantities are determined by adding a reagent that will form a coloured derivative. If the amounts of amino acids are in excess of 10 nmol, ninhydrin can be used for this; it gives a yellow colour when reacted with proline, and a vivid purple with other amino acids. The concentration of amino acid is proportional to the absorbance of the resulting solution. With very small quantities, down to 10 pmol, fluorescent derivatives can be formed using reagents such as ortho-phthaldehyde (OPA) or fluorescamine.

Pre-column derivatization may use the Edman reagent to produce a derivative that is detected by UV light. Greater sensitivity is achieved using a reagent that generates a fluorescent derivative. The derivatized amino acids are subjected to reversed phase chromatography, typically using a C8 or C18 silica column and an optimised elution gradient. The eluting amino acids are detected using a UV or fluorescence detector and the peak areas compared with those for derivatised standards in order to quantify each amino acid in the sample.

N-terminal amino acid analysis

[edit]
Sanger's method of peptide end-group analysis: A derivatization of N-terminal end with Sanger's reagent (DNFB), B total acid hydrolysis of the dinitrophenyl peptide

Determining which amino acid forms the N-terminus of a peptide chain is useful for two reasons: to aid the ordering of individual peptide fragments' sequences into a whole chain, and because the first round of Edman degradation is often contaminated by impurities and therefore does not give an accurate determination of the N-terminal amino acid. A generalised method for N-terminal amino acid analysis follows:

  1. React the peptide with a reagent that will selectively label the terminal amino acid.
  2. Hydrolyse the protein.
  3. Determine the amino acid by chromatography and comparison with standards.

There are many different reagents which can be used to label terminal amino acids. They all react with amine groups and will therefore also bind to amine groups in the side chains of amino acids such as lysine - for this reason it is necessary to be careful in interpreting chromatograms to ensure that the right spot is chosen. Two of the more common reagents are Sanger's reagent (1-fluoro-2,4-dinitrobenzene) and dansyl derivatives such as dansyl chloride. Phenylisothiocyanate, the reagent for the Edman degradation, can also be used. The same questions apply here as in the determination of amino acid composition, with the exception that no stain is needed, as the reagents produce coloured derivatives and only qualitative analysis is required. So the amino acid does not have to be eluted from the chromatography column, just compared with a standard. Another consideration to take into account is that, since any amine groups will have reacted with the labelling reagent, ion exchange chromatography cannot be used, and thin-layer chromatography or high-pressure liquid chromatography should be used instead.

C-terminal amino acid analysis

[edit]

The number of methods available for C-terminal amino acid analysis is much smaller than the number of available methods of N-terminal analysis. The most common method is to add carboxypeptidases to a solution of the protein, take samples at regular intervals, and determine the terminal amino acid by analysing a plot of amino acid concentrations against time. This method will be very useful in the case of polypeptides and protein-blocked N termini. C-terminal sequencing would greatly help in verifying the primary structures of proteins predicted from DNA sequences and to detect any posttranslational processing of gene products from known codon sequences.

Edman degradation

[edit]

The Edman degradation is a very important reaction for protein sequencing, because it allows the ordered amino acid composition of a protein to be discovered. Automated Edman sequencers are now in widespread use, and are able to sequence peptides up to approximately 50 amino acids long. A reaction scheme for sequencing a protein by the Edman degradation follows; some of the steps are elaborated on subsequently.

  1. Break any disulfide bridges in the protein with a reducing agent like 2-mercaptoethanol. A protecting group such as iodoacetic acid may be necessary to prevent the bonds from re-forming.
  2. Separate and purify the individual chains of the protein complex, if there are more than one.
  3. Determine the amino acid composition of each chain.
  4. Determine the terminal amino acids of each chain.
  5. Break each chain into fragments under 50 amino acids long.
  6. Separate and purify the fragments.
  7. Determine the sequence of each fragment.
  8. Repeat with a different pattern of cleavage.
  9. Construct the sequence of the overall protein.

Digestion into peptide fragments

[edit]

Peptides longer than about 50–70 amino acids long cannot be sequenced reliably by the Edman degradation. Because of this, long protein chains need to be broken up into small fragments that can then be sequenced individually. Digestion is done either by endopeptidases such as trypsin or pepsin or by chemical reagents such as cyanogen bromide. Different enzymes give different cleavage patterns, and the overlap between fragments can be used to construct an overall sequence.

Reaction

[edit]

The peptide to be sequenced is adsorbed onto a solid surface. One common substrate is glass fibre coated with polybrene, a cationic polymer. The Edman reagent, phenylisothiocyanate (PITC), is added to the adsorbed peptide, together with a mildly basic buffer solution of 12% trimethylamine. This reacts with the amine group of the N-terminal amino acid.

The terminal amino acid can then be selectively detached by the addition of anhydrous acid. The derivative then isomerises to give a substituted phenylthiohydantoin, which can be washed off and identified by chromatography, and the cycle can be repeated. The efficiency of each step is about 98%, which allows about 50 amino acids to be reliably determined.

A Beckman-Coulter Porton LF3000G protein sequencing machine

Protein sequencer

[edit]

A protein sequenator [3] is a machine that performs Edman degradation in an automated manner. A sample of the protein or peptide is immobilized in the reaction vessel of the protein sequenator and the Edman degradation is performed. Each cycle releases and derivatises one amino acid from the protein or peptide's N-terminus and the released amino-acid derivative is then identified by HPLC. The sequencing process is done repetitively for the whole polypeptide until the entire measurable sequence is established or for a pre-determined number of cycles.

Identification by mass spectrometry

[edit]

Protein identification is the process of assigning a name to a protein of interest (POI), based on its amino-acid sequence. Typically, only part of the protein’s sequence needs to be determined experimentally in order to identify the protein with reference to databases of protein sequences deduced from the DNA sequences of their genes. Further protein characterization may include confirmation of the actual N- and C-termini of the POI, determination of sequence variants and identification of any post-translational modifications present.

Proteolytic digests

[edit]

A general scheme for protein identification is described.[4][5]

  1. The POI is isolated, typically by SDS-PAGE or chromatography.
  2. The isolated POI may be chemically modified to stabilise Cysteine residues (e.g. S-amidomethylation or S-carboxymethylation).
  3. The POI is digested with a specific protease to generate peptides. Trypsin, which cleaves selectively on the C-terminal side of Lysine or Arginine residues, is the most commonly used protease. Its advantages include i) the frequency of Lys and Arg residues in proteins, ii) the high specificity of the enzyme, iii) the stability of the enzyme and iv) the suitability of tryptic peptides for mass spectrometry.
  4. The peptides may be desalted to remove ionizable contaminants and subjected to MALDI-TOF mass spectrometry. Direct measurement of the masses of the peptides may provide sufficient information to identify the protein (see Peptide mass fingerprinting) but further fragmentation of the peptides inside the mass spectrometer is often used to gain information about the peptides’ sequences. Alternatively, peptides may be desalted and separated by reversed phase HPLC and introduced into a mass spectrometer via an ESI source. LC-ESI-MS may provide more information than MALDI-MS for protein identification but uses more instrument time.
  5. Depending on the type of mass spectrometer, fragmentation of peptide ions may occur via a variety of mechanisms such as collision-induced dissociation (CID) or post-source decay (PSD). In each case, the pattern of fragment ions of a peptide provides information about its sequence.
  6. Information including the measured mass of the putative peptide ions and those of their fragment ions is then matched against calculated mass values from the conceptual (in-silico) proteolysis and fragmentation of databases of protein sequences. A successful match will be found if its score exceeds a threshold based on the analysis parameters. Even if the actual protein is not represented in the database, error-tolerant matching allows for the putative identification of a protein based on similarity to homologous proteins. A variety of software packages are available to perform this analysis.
  7. Software packages usually generate a report showing the identity (accession code) of each identified protein, its matching score, and provide a measure of the relative strength of the matching where multiple proteins are identified.
  8. A diagram of the matched peptides on the sequence of the identified protein is often used to show the sequence coverage (% of the protein detected as peptides). Where the POI is thought to be significantly smaller than the matched protein, the diagram may suggest whether the POI is an N- or C-terminal fragment of the identified protein.

De novo sequencing

[edit]

The pattern of fragmentation of a peptide allows for direct determination of its sequence by de novo sequencing. This sequence may be used to match databases of protein sequences or to investigate post-translational or chemical modifications. It may provide additional evidence for protein identifications performed as above.

N- and C-termini

[edit]

The peptides matched during protein identification do not necessarily include the N- or C-termini predicted for the matched protein. This may result from the N- or C-terminal peptides being difficult to identify by MS (e.g. being either too short or too long), being post-translationally modified (e.g. N-terminal acetylation) or genuinely differing from the prediction. Post-translational modifications or truncated termini may be identified by closer examination of the data (i.e. de novo sequencing). A repeat digest using a protease of different specificity may also be useful.

Post-translational modifications

[edit]

Whilst detailed comparison of the MS data with predictions based on the known protein sequence may be used to define post-translational modifications, targeted approaches to data acquisition may also be used. For instance, specific enrichment of phosphopeptides may assist in identifying phosphorylation sites in a protein. Alternative methods of peptide fragmentation in the mass spectrometer, such as ETD or ECD, may give complementary sequence information.

Whole-mass determination

[edit]

The protein’s whole mass is the sum of the masses of its amino-acid residues plus the mass of a water molecule and adjusted for any post-translational modifications. Although proteins ionize less well than the peptides derived from them, a protein in solution may be able to be subjected to ESI-MS and its mass measured to an accuracy of 1 part in 20,000 or better. This is often sufficient to confirm the termini (thus that the protein’s measured mass matches that predicted from its sequence) and infer the presence or absence of many post-translational modifications.

Limitations

[edit]

Proteolysis does not always yield a set of readily analyzable peptides covering the entire sequence of POI. The fragmentation of peptides in the mass spectrometer often does not yield ions corresponding to cleavage at each peptide bond. Thus, the deduced sequence for each peptide is not necessarily complete. The standard methods of fragmentation do not distinguish between leucine and isoleucine residues since they are isomeric.

Because the Edman degradation proceeds from the N-terminus of the protein, it will not work if the N-terminus has been chemically modified (e.g. by acetylation or formation of Pyroglutamic acid). Edman degradation is generally not useful to determine the positions of disulfide bridges. It also requires peptide amounts of 1 picomole or above for discernible results, making it less sensitive than mass spectrometry.

Predicting from DNA/RNA sequences

[edit]

In biology, proteins are produced by translation of messenger RNA (mRNA) with the protein sequence deriving from the sequence of codons in the mRNA. The mRNA is itself formed by the transcription of genes and may be further modified. These processes are sufficiently understood to use computer algorithms to automate predictions of protein sequences from DNA sequences, such as from whole-genome DNA-sequencing projects, and have led to the generation of large databases of protein sequences such as UniProt. Predicted protein sequences are an important resource for protein identification by mass spectrometry.

Historically, short protein sequences (10 to 15 residues) determined by Edman degradation were back-translated into DNA sequences that could be used as probes or primers to isolate molecular clones of the corresponding gene or complementary DNA. The sequence of the cloned DNA was then determined and used to deduce the full amino-acid sequence of the protein.

Bioinformatics tools

[edit]

Bioinformatics tools exist to assist with interpretation of mass spectra (see de novo peptide sequencing), to compare or analyze protein sequences (see sequence analysis), or search databases using peptide or protein sequences (see BLAST).

Applications to cryptography

[edit]

The difficulty of protein sequencing was recently proposed as a basis for creating k-time programs, programs that run exactly k times before self-destructing. Such a thing is impossible to build purely in software because all software is inherently clonable an unlimited number of times.

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Protein sequencing is the process of determining the precise order of in a protein or chain, which is fundamental to elucidating its three-dimensional , biological function, interactions with other molecules, and in cellular processes. This technique underpins the field of , enabling the identification of proteins in complex biological samples, the study of post-translational modifications, and applications in diagnostics, , and . Unlike , which benefits from the genetic code's redundancy, protein sequencing directly reads the primary sequence without inferring it from nucleic acids, making it indispensable for validating predictions and analyzing non-genomic variations. The of protein sequencing began in the early with initial efforts to analyze composition through , but the first complete of a protein—insulin—was achieved by in the early 1950s using a combination of enzymatic and acid followed by chromatographic separation and identification of fragments. This breakthrough, which demonstrated that proteins have defined rather than random structures, earned Sanger the in 1958. In 1949, Pehr Edman introduced the method, a chemical process that selectively cleaves and identifies the N-terminal of a using phenylisothiocyanate, allowing up to 50-60 residues to be sequenced iteratively with high accuracy. In the 1980s, (MS/MS) emerged as a complementary tool, initially coupled with Edman sequencing, but it soon surpassed it due to its ability to handle smaller samples and generate information from fragmentation patterns. Traditional protein sequencing methods, such as , require purified proteins and are limited to linear N-terminal reading, making them labor-intensive and unsuitable for high-throughput of complex mixtures. Liquid chromatography-tandem (LC-MS/MS) has dominated since the 1990s, serving as the gold standard for by ionizing peptides, fragmenting them via , and inferring sequences from mass-to-charge ratios, with sensitivities reaching femtomolar levels. However, LC-MS/MS faces challenges including limited (typically 10^4 to 10^5), reliance on database matching for identification, and the need for extensive , which can introduce biases. In recent decades, the field has shifted toward next-generation approaches emphasizing single-molecule resolution to overcome these limitations and enable de novo sequencing without prior genomic knowledge. Emerging technologies include fluorosequencing, which adapts with fluorescent labeling for optical detection of at the single-molecule level, and nanopore-based methods, where proteins are unfolded and translocated through a to generate electrical signals distinguishing the 20 based on current blockades or dwell times. As of 2025, advances in AI-driven analysis and technologies have further improved de novo sequencing accuracy and throughput. These innovations, pioneered in the , promise portability, lower costs, and the ability to sequence intact proteins or low-abundance species directly from single cells, though they still grapple with issues like uniform translocation control and discrimination accuracy. Ongoing developments in these areas are poised to transform into a routine tool comparable to .

History and Fundamentals

Historical Development

The early development of protein sequencing began with foundational work on peptide chemistry in the early . In 1901, synthesized the first , glycylglycine, through partial of glycine diketopiperazine, and extended these experiments to analyze protein composition by hydrolyzing polypeptides into constituent , establishing the basis for understanding linkages in proteins. Advances accelerated in the mid-20th century with the development of end-group analysis techniques in the 1940s and , which allowed identification of terminal in polypeptide chains. British biochemist pioneered this approach using 2,4-dinitrofluorobenzene (DNFB) to label N-terminal residues, enabling their separation and quantification. Building on this, Sanger applied and partial acid to determine the complete sequence of insulin in the early , revealing its two-chain structure linked by disulfide bonds—a breakthrough that demonstrated proteins possess defined, genetically encoded sequences. For this work, Sanger received the 1958 . A pivotal milestone came in 1950 with the introduction of by Pehr Edman, a cyclic chemical method that sequentially removes and identifies N-terminal from peptides without disrupting the remaining chain, greatly improving sequencing efficiency over partial . In the 1960s, emerged as a complementary tool for protein sequencing, with early applications by Klaus Biemann in 1966 enabling the analysis of oligopeptides through fragmentation patterns, marking the shift toward instrumental methods. The 1980s and 1990s saw the transition to automated and high-throughput protein sequencing, driven by refinements in such as gas-phase sequencers introduced in the early 1980s, which minimized sample loss and enabled routine analysis of longer polypeptides. By the 1990s, integration with further boosted throughput, supporting large-scale proteomic studies and paving the way for genome-protein correlations.

Basic Principles and Importance

Protein primary structure refers to the linear sequence of in a polypeptide chain, where individual are covalently linked by peptide bonds between the carboxyl group of one and the amino group of the next. This sequence determines the protein's unique identity and serves as the foundation for higher levels of structure, including secondary, tertiary, and quaternary folds that enable biological function. The , which translates nucleotide sequences in into protein sequences, specifies 20 standard using 64 possible codons, with most encoded by multiple codons to provide . These vary in their side chains, conferring diverse chemical properties that influence , stability, and interactions. Determining protein sequences is essential for elucidating protein function, as the primary structure dictates enzymatic activity, binding specificity, and cellular roles. In , sequence comparisons reveal conservation patterns and divergence, illuminating phylogenetic relationships and adaptive changes. For research, sequencing identifies mutations that disrupt function; for instance, a single substitution ( to at position 6) in the beta-globin chain causes sickle cell anemia by altering hemoglobin's solubility and leading to deformation. In , precise sequence knowledge enables targeted therapies, such as monoclonal antibodies or small molecules that bind specific epitopes. Protein sequencing also underpins , the large-scale study of proteomes, facilitating discovery and systems-level insights into cellular processes. Despite its value, protein sequencing faces challenges due to proteins' inherent heterogeneity, where isoforms arise from or genetic variants, complicating uniform analysis. Post-translational modifications (PTMs), such as or , add chemical diversity that can obscure sequences and affect function without altering the . Additionally, proteins range from tens to thousands of amino acids in length—human , for example, comprises over 34,000 residues—posing technical hurdles for complete coverage in long chains. Protein sequencing approaches are broadly classified as direct (de novo) methods, which experimentally determine the amino acid order without prior genomic data, or indirect methods, which predict sequences from DNA/RNA templates or computational models. Direct methods provide empirical validation, especially for novel or modified proteins, while indirect approaches leverage genomic data for efficiency in well-annotated systems.

Amino Acid Composition Analysis

Hydrolysis Techniques

Hydrolysis techniques are essential for determining the composition of proteins, as they cleave bonds to release free for subsequent analysis. These methods must balance complete with minimal degradation or modification of labile residues, though no single approach achieves perfect recovery for all 20 standard . Acid remains the most widely used due to its efficiency, while alternatives address specific limitations such as destruction. Acid hydrolysis typically employs 6 M (HCl) at 110°C for 24 hours in sealed, evacuated tubes to prevent oxidation. This condition achieves near-complete cleavage of bonds for most residues, with recoveries of 86–103% for standard proteins like and (BSA). However, it fully destroys and partially degrades , , , and , while converting and to aspartic and glutamic acids, respectively. To mitigate oxidation of sulfur-containing , additives like 0.4% β-mercaptoethanol or are included. Base hydrolysis, using 4–6 M (NaOH) or (LiOH) at 110–112°C for 16–22 hours, is primarily employed to preserve , which yields recoveries typically 80-100% under optimized conditions compared to none in acid conditions. It is performed in inert atmospheres or with antioxidants like partially hydrolyzed to minimize losses, and results in similar tryptophan values between NaOH and LiOH. This method, however, risks and of other residues and is less suitable for comprehensive composition analysis due to incomplete hydrolysis of certain bonds. Enzymatic hydrolysis offers milder conditions using proteases such as , which cleaves at and residues, or broader enzymes like pronase for near-total breakdown. Conducted at 37–50°C and neutral for 24–72 hours, it preserves labile like and avoids harsh chemical artifacts, but achieves only partial completeness (e.g., underestimating aspartic and glutamic acids) and is more costly for routine total composition work. It is better suited for generating peptides rather than free . Microwave-assisted hydrolysis accelerates traditional acid methods by applying focused energy to 6 M HCl solutions, reducing processing time to 5–30 minutes at 100–150°C while maintaining high and coverage of protein sequences. For instance, it generates up to 1,292 peptides from 2 μg of BSA, enabling faster for spectrometry-based composition without significant loss in yield compared to conventional 24-hour incubations. Common artifacts in these techniques include , where and convert to aspartic and glutamic acids during acid or base , leading to overestimation of the latter by up to 100% of the former's content. , producing D-isomers from L-amino acids (e.g., 1–4% D-Asp formation), occurs via cyclic intermediates under alkaline or prolonged acidic conditions, particularly affecting and . These modifications necessitate corrections or alternative methods for accurate quantification, often followed by chromatographic separation for residue identification.

Separation and Quantification Methods

Following hydrolysis of proteins into constituent amino acids, separation and quantification methods are essential to determine the molar composition, which serves as a foundational step for inferring sequence information. The classical approach employs ion-exchange chromatography, where amino acids are separated based on their differing affinities for a cation-exchange resin, typically using a gradient of buffers with increasing pH and ionic strength. This method, pioneered by Moore, Stein, and Spackman in 1958, utilizes a single-column, automated system that resolves up to 20 standard amino acids in sequence. Detection occurs post-column via reaction with ninhydrin, producing colored derivatives (purple for most amino acids, yellow for proline) that are quantified spectrophotometrically at 570 nm and 440 nm, respectively. This technique remains a gold standard for its reliability in physiological and protein hydrolysate samples. An alternative, widely adopted method is reverse-phase (RP-HPLC), which offers faster separation and higher throughput compared to ion-exchange. are derivatized pre-column to enhance detectability: phenylisothiocyanate (PITC) forms stable phenylthiocarbamyl (PTC) derivatives detected at 254 nm, as described by Heinrikson and Meredith in 1984. Alternatively, o-phthalaldehyde (OPA) reacts with primary to yield fluorescent isoindoles, enabling sensitive detection via at excitation/emission wavelengths of 340/450 nm, per Jones and Gilligan's 1983 protocol. Separation occurs on a C18 reversed-phase column using an acetonitrile-water , resolving in under 30 minutes. Quantification in both methods relies on peak area integration from chromatograms, calibrated against external standards of known concentrations to generate response factors. This approach achieves accuracy of 1-5% relative standard deviation for most , with internal standards like norleucine correcting for losses or variations. Modern enhancements include ultra-performance liquid chromatography (UPLC), which employs sub-2 μm particles for superior resolution and reduced analysis time to 10-15 minutes. Coupling with (LC-MS/MS) provides confirmatory identification via mass-to-charge ratios, improving specificity for isobaric like and . Results are typically reported as molar ratios of each relative to a reference residue, such as set to 1, facilitating comparison across protein samples and aiding in molecular weight estimation.

Terminal Residue Identification

N-Terminal Analysis

N-terminal analysis focuses on identifying the residue at the free α-amino group of a or chain, providing key insights into protein identity, purity, and processing events such as post-translational modifications. This technique is particularly valuable in early stages of protein characterization, as the N-terminus often reflects the protein's maturation, including cleavage of signal peptides or leader sequences. Unlike total amino acid composition analysis, which yields overall residue frequencies, N-terminal methods target the specific endpoint residue, enabling confirmation of sequence starts in heterogeneous samples. The pioneering chemical approach for N-terminal determination was developed by Frederick Sanger in 1945 using 2,4-dinitrofluorobenzene (DNFB), also known as Sanger's reagent. The method involves reacting the intact protein with DNFB under mildly alkaline conditions, where the reagent selectively couples with the unprotonated α-amino group of the N-terminal residue to form a yellow-colored dinitrophenyl (DNP) derivative. Subsequent complete hydrolysis of the labeled protein with acid (e.g., 6 M HCl) breaks all peptide bonds, liberating the DNP-N-terminal amino acid, which remains intact due to its stability under these conditions, while other amino acids are released in free form. The mixture is then separated by two-dimensional paper chromatography, where the DNP-amino acid is identified by its characteristic Rf value and spot color upon comparison with standards. This technique was instrumental in Sanger's elucidation of insulin's structure, identifying phenylalanine as the N-terminal residue of the B-chain and glycine for the A-chain, marking a milestone in proving proteins have defined sequences. Limitations include its destructive nature, as it consumes the entire protein sample, and challenges with lysine residues, which also react to form ε-DNP-lysine, complicating identification. Enzymatic methods offer a milder alternative, utilizing exopeptidases like M or aminopeptidase to sequentially or selectively release the N-terminal . These enzymes catalyze the hydrolysis of the adjacent to the , liberating the free into solution, which is then quantified and identified via techniques such as reversed-phase (HPLC) or post-column derivatization with followed by detection. For instance, controlled incubation with can release one or a few residues, allowing stepwise analysis, though specificity varies—some enzymes prefer hydrophobic residues like or . This approach is advantageous for native proteins, preserving during initial steps, and is often used in combination with inhibitors to limit digestion depth. However, it requires active, unblocked N-termini and can be hindered by secondary or modifications that sterically impede enzyme access. Mass spectrometry-based N-terminal analysis has become a cornerstone of modern due to its sensitivity and ability to handle complex samples. Proteins are typically digested with endoproteases like to generate peptides, followed by (MS/MS), where produces fragment ions. The N-terminal sequence is inferred from b-ions, which retain the charge on the N-terminal fragment and exhibit mass-to-charge ratios differing by the residue masses of successive (e.g., a 14 Da difference for vs. ). Techniques such as electron transfer dissociation (ETD) enhance coverage by generating c-ions, complementary to b-ions, for more robust identification. This method detects as little as femtomoles of material and can reveal modifications like by mass shifts (e.g., +42 Da for acetyl). Enrichment strategies, such as using negative selection for internal peptides, further isolate N-terminal peptides for targeted analysis. A related chemical strategy, previewed in Pehr Edman's 1950 method, employs phenylisothiocyanate (PITC) to derivatize the N-terminal amino group into a phenylthiohydantoin (PTH) , which is mildly cleaved and identified by , setting the stage for iterative sequencing without full protein destruction. While N-terminal identification alone confirms endpoints, extends this principle to sequential residue determination. Applications of N-terminal analysis span and , particularly in verifying recombinant proteins where the expressed must match the predicted sequence post-cleavage of affinity tags or signal peptides, ensuring functionality and batch consistency. It is also critical for detecting blocked N-termini, such as N-acetylated residues (common in approximately 80-90% of eukaryotic proteins, particularly in humans) or pyroglutamyl formations, which obscure standard sequencing and signal regulatory roles like stability or localization; often resolves these by precise mass mapping. In workflows, it aids de novo sequencing starts and impurity detection in therapeutic proteins.

C-Terminal Analysis

C-terminal analysis in protein sequencing focuses on identifying the residue at the carboxyl terminus, providing essential information for verifying the directionality of the polypeptide and confirming overall integrity. Unlike N-terminal methods, which target the amino group, C-terminal approaches exploit the reactivity of the carboxyl group to release or label the terminal residue sequentially. Early techniques emphasized enzymatic and chemical degradation, while contemporary methods integrate for enhanced precision and throughput. The primary enzymatic approach involves carboxypeptidases, which are exopeptidases that sequentially hydrolyze peptide bonds from the , releasing free that can be quantified over time to deduce the sequence. Carboxypeptidase A (CPA), derived from bovine , preferentially cleaves non-basic, non- residues such as aromatic and aliphatic , making it suitable for initial C-terminal identification in many proteins. Carboxypeptidase B (CPB) complements CPA by specifically targeting basic residues like and at the , allowing for a combined enzymatic strategy to handle diverse terminal sequences. For broader applicability, carboxypeptidase Y (CPY) from is widely used due to its broad substrate specificity, cleaving nearly all C-terminal residues including , though it is often employed for limited sequencing of 5-10 residues to avoid incomplete reactions. A classical chemical method for C-terminal is hydrazinolysis, developed by Shiro Akabori in the early 1950s. In this procedure, the protein is treated with anhydrous at elevated temperatures (around 100°C for several hours), which selectively converts the C-terminal carboxyl group to a while internal bonds undergo partial cleavage, yielding free from non-terminal positions that can be separated. The C-terminal is then isolated and identified via or derivatization, such as with dinitrophenyl (DNP) reagents, enabling unambiguous assignment. This method, first applied to and proteins like insulin, marked a significant advance in the 1940s-1950s for confirming C-terminal residues without enzymatic biases. In modern workflows, mass spectrometry enhances C-terminal analysis, particularly through ladder sequencing coupled with carboxypeptidase digestion. Time- or concentration-dependent digestion with CPY generates a series of truncated peptides, which are analyzed by matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS); the mass differences between peaks correspond to specific amino acid residues, revealing the C-terminal sequence. In tandem MS (MS/MS), fragmentation of peptides produces y-ions—characteristic fragments retaining the C-terminus—whose masses allow direct inference of the terminal sequence from the low-mass end of the spectrum. Despite these advances, C-terminal sequencing faces challenges, including slow or incomplete digestion by carboxypeptidases for hydrophobic residues like , , and , which can hinder sequential release and lead to ambiguous results. Hydrazinolysis, while specific, risks partial degradation of sensitive residues such as serine and , and requires conditions to minimize side reactions. These limitations often necessitate orthogonal methods for verification, particularly in complex proteomes.

Edman Degradation

Peptide Fragmentation

Peptide fragmentation is a critical step in protein sequencing, where intact proteins are cleaved into smaller peptides to facilitate subsequent analysis by methods such as or . This process generates manageable fragments typically 5–50 long, allowing for the determination of partial sequences that can be assembled into the full protein sequence. Cleavage is achieved through either enzymatic or chemical means, each offering specific advantages in terms of site selectivity and conditions. Enzymatic digestion employs proteases with defined specificity to hydrolyze peptide bonds under mild aqueous conditions, preserving the integrity of amino acid side chains. Trypsin, a serine protease, cleaves exclusively at the C-terminal side of lysine (Lys) and arginine (Arg) residues, except when followed by proline, producing peptides with basic C-termini that are amenable to further purification. Chymotrypsin preferentially cleaves after large hydrophobic residues such as phenylalanine (Phe), tyrosine (Tyr), and tryptophan (Trp), though it can also act on leucine (Leu) and methionine (Met) at lower rates, generating aromatic-containing peptides useful for mapping hydrophobic regions. Endoproteinase Glu-C (also known as V8 protease) targets glutamic acid (Glu) residues at the C-terminus, with activity extending to aspartic acid (Asp) under certain pH conditions (e.g., pH 4.0 in phosphate buffer), enabling the production of acidic peptides for complementary coverage. Chemical cleavage methods provide alternatives when enzymatic approaches are insufficient, often targeting less frequent residues for broader fragment spacing. Cyanogen bromide (CNBr) reacts with the sulfur of methionine (Met) residues to cleave at the C-terminal side, converting Met to homoserine lactone and yielding peptides suitable for N-terminal sequencing; this method is particularly effective for proteins with few Met residues, as demonstrated in early structural studies of cytochromes. Endoproteinase Asp-N, a metalloprotease, cleaves on the N-terminal side of aspartic acid (Asp) residues, and to a lesser extent glutamic acid (Glu), producing peptides with Asp at the N-terminus that aid in resolving regions resistant to other cleavages. To reconstruct the complete protein sequence from fragmented peptides, an overlap strategy is employed, involving multiple parallel digests with different enzymes or chemicals to generate sets of peptides that share overlapping sequences. These overlaps allow alignment and assembly, as pioneered in the sequencing of insulin where tryptic and chymotryptic fragments were compared to order the chain. Following digestion, peptides are often separated by gel-based electrophoresis to isolate individual components prior to sequencing; sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) resolves peptides by molecular weight under denaturing conditions, while two-dimensional (2D) electrophoresis combines isoelectric focusing with SDS-PAGE for enhanced resolution of complex mixtures. Optimization of fragmentation yield is essential, particularly for proteins with bonds that can hinder access. Reduction of cystine (Cys-Cys) bridges using agents like (DTT), followed by alkylation of free thiols with (IAA), unfolds the protein and prevents re-formation of disulfides, ensuring complete digestion and higher sequence coverage in downstream Edman or workflows.

Chemical Reaction Mechanism

The Edman degradation proceeds through a cyclic series of chemical reactions that selectively label, cleave, and identify the N-terminal amino acid of a peptide, enabling sequential sequencing without disrupting the remaining chain. In the initial coupling step, phenylisothiocyanate (PITC) is reacted with the free α-amino group of the N-terminal residue under mildly basic conditions (pH 8–9), typically in a buffered aqueous solution. The nucleophilic nitrogen of the amine attacks the electrophilic central carbon of the isothiocyanate group, forming a stable phenylthiocarbamoyl (PTC) derivative via addition-elimination, with the release of aniline as a byproduct. This step is highly selective for the unprotonated primary amine, minimizing side reactions with other nucleophilic groups in the peptide. Following coupling, the PTC-peptide undergoes cleavage in the presence of (TFA) at room temperature for approximately 10–30 minutes. The acidic conditions protonate the sulfur atom in the PTC group, facilitating an intramolecular nucleophilic attack by the peptide carbonyl oxygen on the PTC carbon, which leads to cyclization and formation of a five-membered thiazolinone ring. This cyclization cleaves the scissile adjacent to the N-terminal residue, releasing the thiazolinone derivative while leaving the shortened intact and ready for the next cycle. The reaction is quantitative under conditions, ensuring minimal of internal bonds. The unstable thiazolinone is then converted to the stable phenylthiohydantoin (PTH) derivative through acid-catalyzed rearrangement, often by brief treatment with aqueous TFA or heating in an acidic medium. This involves ring opening and recyclization, incorporating the side chain of the original amino acid into a thiohydantoin heterocycle that is soluble in organic solvents and amenable to chromatographic identification. The PTH-amino acid is extracted into an organic phase (e.g., ethyl acetate) and analyzed, typically by reverse-phase HPLC, by comparison of retention times with PTH standards derived from known amino acids. The overall process per cycle can be represented as: Peptide-NH2+PITCPTC-PeptideTFAThiazolinone+Peptide(-1)-NH2aq. acidPTH-AA\text{Peptide-NH}_2 + \text{PITC} \rightarrow \text{PTC-Peptide} \xrightarrow{\text{TFA}} \text{Thiazolinone} + \text{Peptide(-1)-NH}_2 \xrightarrow{\text{aq. acid}} \text{PTH-AA}
Add your contribution
Related Hubs
User Avatar
No comments yet.