Recent from talks
Contribute something
Nothing was collected or created yet.
PROSITE
View on Wikipedia![]() | |
| Content | |
|---|---|
| Description | PROSITE, a protein domain database for functional characterization and annotation. |
| Contact | |
| Research center | Swiss Institute of Bioinformatics |
| Laboratory | Structural Biology and Bioinformatics Department |
| Primary citation | PMID 19858104 |
| Release date | 1988 |
| Access | |
| Website | prosite |
PROSITE is a protein database.[1][2] It consists of entries describing the protein families, domains and functional sites as well as amino acid patterns and profiles in them. These are manually curated by a team of the Swiss Institute of Bioinformatics and tightly integrated into Swiss-Prot protein annotation. PROSITE was created in 1988 by Amos Bairoch, who directed the group for more than 20 years. Since July 2018, the director of PROSITE and Swiss-Prot is Alan Bridge.
PROSITE's uses include identifying possible functions of newly discovered proteins and analysis of known proteins for previously undetermined activity. Properties from well-studied genes can be propagated to biologically related organisms, and for different or poorly known genes biochemical functions can be predicted from similarities. PROSITE offers tools for protein sequence analysis and motif detection (see sequence motif, PROSITE patterns). It is part of the ExPASy proteomics analysis servers.
The database ProRule builds on the domain descriptions of PROSITE.[3] It provides additional information about functionally or structurally critical amino acids. The rules contain information about biologically meaningful residues, like active sites, substrate- or co-factor-binding sites, posttranslational modification sites or disulfide bonds, to help function determination. These can automatically generate annotation based on PROSITE motifs.
Statistics
[edit]As of 26 February 2022[update], release 2022_01 has 1,902 documentation entries, 1,311 patterns, 1,336 profiles, and 1,352 ProRules.
See also
[edit]- Uniprot — the universal protein database, a central resource on protein information - PROSITE adds data to it.
- InterPro — a centralized database, grouping data from databases of protein families, domains and functional sites - part of the data come from PROSITE.
- Protein subcellular localization prediction — another example of use of PROSITE.
References
[edit]- ^ De Castro E, Sigrist CJA, Gattiker A, Bulliard V, Langendijk-Genevaux PS, Gasteiger E, Bairoch A, Hulo N (2006). "ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins". Nucleic Acids Res. 34 (Web Server issue): W362–365. doi:10.1093/nar/gkl124. PMC 1538847. PMID 16845026.
- ^ Hulo N, Bairoch A, Bulliard V, Cerutti L, Cuche B, De Castro E, Lachaize C, Langendijk-Genevaux PS, Sigrist CJA (2007). "The 20 years of PROSITE". Nucleic Acids Res. 36 (Database issue): D245–9. doi:10.1093/nar/gkm977. PMC 2238851. PMID 18003654.
- ^ Sigrist CJ, De Castro E, Langendijk-Genevaux PS, Le Saux V, Bairoch A, Hulo N (2005). "ProRule: a new database containing functional and structural information on PROSITE profiles". Bioinformatics. 21 (21): 4060–4066. doi:10.1093/bioinformatics/bti614. PMID 16091411.
External links
[edit]- Official website
- ProRule — database of rules based on PROSITE predictors
PROSITE
View on GrokipediaOverview
Definition and Purpose
PROSITE is a specialized database and method designed for the detection of biologically meaningful signatures in protein sequences, enabling the inference of protein function, structure, or evolutionary relationships. It compiles documentation on protein domains, families, and functional sites, along with associated signatures in the form of patterns and profiles that can be used to identify these elements in query sequences.[1] The primary purpose of PROSITE is to facilitate the annotation of uncharacterized proteins derived from genomic or cDNA sequencing projects by matching their sequences against known motifs, thereby predicting potential functions or classifications. This approach allows researchers to automatically classify proteins into specific families or detect critical functional sites, such as active sites or binding regions, based on conserved sequence features.[1][5] Created in 1988 as a tool for protein sequence analysis, PROSITE has become an integral part of the broader bioinformatics ecosystem hosted by ExPASy, supporting automated and manual analyses in functional genomics. By leveraging signatures like patterns—regular expressions capturing short conserved motifs—and profiles—position-specific scoring matrices for more distant similarities—the database enables reliable detection of evolutionary and functional relationships without requiring full sequence alignment.[5][1]Key Components
PROSITE entries are structured around several core elements that enable the identification and annotation of protein motifs, domains, and functional sites. These include textual documentation providing biological context, patterns as regular expressions for detecting short conserved sequence motifs, profiles as position-specific scoring matrices (PSSMs) for recognizing longer protein domains, and ProRules as logical rules that combine multiple criteria to refine predictions of functional sites.[6][7] Patterns in PROSITE represent short, highly conserved sequence motifs, typically 3 to 30 amino acids long, and are expressed using a specialized syntax that accounts for allowed residues, gaps, and exclusions. The syntax employs square brackets for alternative residues (e.g., [AC] for alanine or cysteine), 'x(n)' for n repetitions of any residue, and curly braces for exclusions (e.g., {P} to exclude proline). A representative example is the N-glycosylation site pattern, defined as N-{P}[ST]{P}, where asparagine (N) is the modification site, followed by any residue except proline, then serine or threonine, and excluding proline afterward; this motif is common in eukaryotic proteins for attaching N-linked glycans.[6][8] Patterns are designed for high specificity and sensitivity, often validated against known protein sequences to minimize false positives.[6] Profiles serve as more flexible signatures for extended protein regions, functioning as PSSMs derived from alignments of related sequences to score potential matches based on position-specific residue frequencies and evolutionary conservation. Unlike rigid patterns, profiles allow fuzzy matching similar to hidden Markov models (HMMs), accommodating variations in domain length and sequence divergence through logarithmic scoring and threshold cut-offs (e.g., a positive threshold for reliable hits and a noisy threshold for exploratory matches). For instance, a profile for the HSP20 family domain uses a matrix to evaluate alignments across heat shock protein sequences, enabling detection of distant homologs.[6][9] These are particularly useful for globular domains where sequence conservation is moderate.[7] ProRules extend the utility of patterns and profiles by incorporating logical conditions, such as requiring specific residue contexts or structural features, to predict functional sites more accurately; for example, a ProRule might combine a phosphorylation pattern with proximity to a catalytic residue. Written in the UniRule format, these rules are triggered only when underlying signatures match, enhancing annotation precision in automated pipelines.[10][6] Supporting these signatures are associated data structures that provide context and reliability assessments, including taxonomic scope to indicate applicable organism groups (e.g., restricted to eukaryotes via qualifiers like /TAXO-RANGE=E?), cross-references to databases such as UniProtKB for validated examples and PDB for structural data, and evidence levels denoting the strength of annotations (e.g., experimental confirmation versus computational prediction, often quantified by positive hit counts in curated sequences).[6] These elements ensure that PROSITE signatures are biologically grounded and interoperable with other resources.[7]History and Development
Origins and Creation
PROSITE was created in 1988 by Amos Bairoch, a bioinformatician at the Department of Medical Biochemistry, University of Geneva, as an early tool for protein sequence analysis in the emerging field of bioinformatics.[11][12] Bairoch, who had previously initiated the Swiss-Prot protein sequence database in 1986, recognized the limitations of basic sequence comparisons and sought to develop a specialized resource for detecting biologically significant features in proteins.[12] The primary motivation for PROSITE arose from the rapid accumulation of protein sequence data in the late 1980s, particularly through databases like Swiss-Prot, which by 1988 contained thousands of entries but lacked efficient methods for identifying shared functional motifs across related proteins.[12] Bairoch aimed to address this by compiling patterns derived from Swiss-Prot annotations, enabling the systematic detection of protein families, domains, and functional sites to facilitate annotation and discovery of new members in uncharacterized sequences.[12] This approach was essential in an era when manual curation was predominant, and automated tools for motif recognition were scarce. PROSITE's first release occurred in March 1988, distributed via the PC/Gene software package from IntelliGenetics, and included a modest collection of 58 manually curated patterns extracted from the scientific literature, each accompanied by a descriptive abstract outlining the associated protein family or domain.[12] Early development faced significant challenges, including the constraints of limited computational power on personal computers of the time, which restricted the scope and sophistication of pattern searches.[12] Moreover, the initial patterns depended on exact sequence matching, making them vulnerable to false negatives from sequence variations, errors, or evolutionary divergence, a limitation that persisted until the later adoption of profile-based methods.[12]Evolution and Milestones
In the 1990s, PROSITE underwent significant expansions to address limitations in its initial pattern-based approach, particularly for detecting variable protein domains. In 1994, generalized profiles were introduced by Philipp Bucher, enabling the representation of more flexible motifs through position-specific scoring matrices (PSSMs) that captured sequence conservation and variability more effectively than rigid patterns.[12] This innovation allowed PROSITE to handle diverse family alignments, improving sensitivity for distant homologs. From its inception, integration with the Swiss-Prot (now UniProtKB/Swiss-Prot) database has facilitated automated annotation, where PROSITE signatures were cross-referenced to annotate protein functions directly during Swiss-Prot curation, enhancing the database's utility for large-scale sequence analysis.[12] The 2000s marked further milestones in diversifying PROSITE's methodology and scale. In 2005, the ProRule system was added, providing rule-based predictions that generate precise functional annotations based on profile or pattern matches, such as specifying post-translational modifications or active sites.[13] By that year, PROSITE had surpassed 1,000 documentation entries, reflecting steady growth in curated motifs.[14] The database celebrated its 20-year anniversary in 2008, at which point release 20.19 covered 53% of UniProtKB/Swiss-Prot entries, demonstrating its expanding impact on protein annotation.[12] From the 2010s to the 2020s, PROSITE refined its signature detection through methodological advancements and broader interoperability. Around 2008, PROSITE's development was formally placed under the stewardship of the SIB Swiss Institute of Bioinformatics, enhancing its integration with other SIB resources like UniProtKB.[15] Alignment with InterPro since the late 1990s has enabled hierarchical organization of domains within protein families, allowing PROSITE signatures to contribute to integrated views of evolutionary relationships across multiple databases.[16] As of release 2025_04 in October 2025, PROSITE comprises over 1,900 documentation entries, underscoring its ongoing evolution.[1] These developments represent a shift from a pattern-only system to a hybrid framework combining patterns, profiles, and rules, which has broadened PROSITE's applicability in functional annotation. Open access via the ExPASy server, established in the 1990s, has supported global usage and continuous updates.[1]Database Content
Entry Formats
PROSITE entries follow a standardized structure designed for clarity and machine readability, beginning with an identifier line (ID) that provides the entry name and type, such as "PROTEIN_KINASE_DOM MATRIX" for a pattern-based entry or "MATRIX" for profiles.[6] This is followed by the accession number (AC), a unique identifier like PS50011; a description line (DE) summarizing the motif or domain, e.g., "Protein kinase domain profile"; and specific signature lines such as PA for patterns (using IUPAC amino acid codes and qualifiers like x for any residue) or MA for profile matrices.[6] Additional lines include PR for associated ProRules (logical validation rules, e.g., PRU00159), RU for numerical performance results from scans against UniProtKB, and DR for cross-references to external databases like UniProt entries.[17] Other sections cover comments (CC) for evidence and taxonomy via /TAXO-RANGE qualifiers, documentation references (DO) linking to PDOC entries, and termination with "//".[6] The primary distribution format is a flat-file text-based representation, human-readable and structured with fixed two-character line codes followed by content, limited to 78 characters per line except for matrix data.[6] This format, contained in files like prosite.dat, prosite.doc for documentation, and profile.dat for matrices, enables bulk downloads from the ExPASy FTP server (ftp.expasy.org/databases/prosite/) and supports parsing by bioinformatics tools.[6] Since the 2000s, PROSITE data has also been accessible in XML for structured programmatic querying, particularly through integrations like UniProt's XML exports that embed PROSITE annotations, and in RDF for semantic web applications via projects like Bio2RDF. A representative example is the entry for the protein kinase domain (PS50011), which illustrates the format's organization (as of release 2025_04):ID PROTEIN_KINASE_DOM MATRIX; PRF; 259 aa; matrix.
AC PS50011;
DE Protein kinase domain profile.
DO PDOC00100;
CC -!- MATRIX_TYPE: protein_domain;
CC -!- TAXO-RANGE: Archaea; Bacteria; Eukaryota; Eukaryotic viruses.
CC -!- AUTHOR: P.Bucher
PR PRU00159;
RU True positives: 4504 (4438 sequences); False positives: 11; False negatives: 243.
DR UNI; Q6GZV6; Q197B6; ... (4438 true positive sequences); B9DGY1; Q93Y08; ... (243 false negatives); P58551; Q9KVB9; ... (11 false positives).
MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ*'; LENGTH=259;
MA [Excerpt of position-specific scores, e.g., row 1: A= 0 B=-5 ... Z=-5]
ID PROTEIN_KINASE_DOM MATRIX; PRF; 259 aa; matrix.
AC PS50011;
DE Protein kinase domain profile.
DO PDOC00100;
CC -!- MATRIX_TYPE: protein_domain;
CC -!- TAXO-RANGE: Archaea; Bacteria; Eukaryota; Eukaryotic viruses.
CC -!- AUTHOR: P.Bucher
PR PRU00159;
RU True positives: 4504 (4438 sequences); False positives: 11; False negatives: 243.
DR UNI; Q6GZV6; Q197B6; ... (4438 true positive sequences); B9DGY1; Q93Y08; ... (243 false negatives); P58551; Q9KVB9; ... (11 false positives).
MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ*'; LENGTH=259;
MA [Excerpt of position-specific scores, e.g., row 1: A= 0 B=-5 ... Z=-5]
Pattern and Profile Types
PROSITE signatures are categorized into patterns and profiles, each designed to detect specific biological features in protein sequences with varying degrees of sensitivity and specificity. Patterns represent short, conserved sequence motifs using regular expression notation, while profiles employ position-specific scoring matrices (PSSMs) or more advanced generalized profiles to model entire protein domains or families. These signature types target distinct biological entities, such as functional sites, structural domains, evolutionary families, and tandem repeats, enabling the identification of protein functions and relationships.[6] Patterns in PROSITE are qualitative descriptors that match sequences based on exact or fuzzy criteria, often using IUPAC ambiguity codes and operators like 'x' for any residue or '{' and '}' for exclusions. They are particularly suited for highly conserved, short motifs, such as catalytic active sites or sites of post-translational modifications, where high specificity is crucial to minimize false positives. For instance, a pattern for the cutinase active site is expressed asP-x-[STA]-x-[LIV]-[IVT]-x-[GS]-G-Y-S-[QL]-G, which detects the precise arrangement of residues essential for enzymatic activity. Patterns for rare functional sites, like phosphorylation motifs (e.g., [ST]-x-[RK] for protein kinase targets), employ stringent criteria to ensure matches are biologically relevant, whereas those for more common domains may allow greater flexibility to capture broader occurrences. This approach balances detection of true positives against the risk of over-matching in diverse protein contexts.[6][7]
Profiles, in contrast, provide quantitative models for longer sequence regions, constructed from multiple sequence alignments (MSAs) of related proteins using tools like ps_scan or pftools. Basic PSSMs assign position-specific scores based on observed residue frequencies and substitution matrices, such as BLOSUM, to evaluate sequence similarity across an entire domain; for example, the globin family profile spans approximately 140 positions and scores alignments to identify oxygen-binding domains with a cutoff threshold for significance. Generalized profiles, introduced by Bucher in 1994, extend this by incorporating hidden Markov model (HMM)-like features, including penalties for insertions and deletions, which enhance sensitivity for detecting distant homologs in less conserved families. These are built by aligning sequences (e.g., via ClustalW or T-Coffee), extending fragments to include flanking regions, and optimizing for subfamily specificity, as seen in profiles for animal peroxidases that capture evolutionary divergences while maintaining low false-positive rates. Profiles thus excel in annotating structural and functional domains where patterns alone lack sufficient power.[6][12][7]
Hybrid signatures combine patterns and profiles to optimize both sensitivity and specificity, often integrating ProRules for contextual validation. For example, a profile may detect a broad domain like an ATP-binding helicase, while an embedded pattern confirms critical catalytic residues, and rules promote weak profile matches if the pattern aligns. This synergy is vital for complex annotations, reducing errors in family assignment.[6][12]
Biologically, PROSITE signatures classify protein features into domains as compact structural or functional units (e.g., zinc fingers detected by profiles), families as evolutionarily related groups (e.g., small GTPases via generalized profiles), repeats as tandemly arrayed motifs (e.g., EF-hand calcium-binding repeats with context-dependent scoring), and sites as localized functional elements (e.g., phosphorylation or metal-binding sites marked in patterns). These categories facilitate targeted detection, with domains and families relying more on profiles for comprehensive coverage, while sites and repeats favor patterns for precision.[6][7]

