Hubbry Logo
Candidate keyCandidate keyMain
Open search
Candidate key
Community hub
Candidate key
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Candidate key
Candidate key
from Wikipedia

A candidate key, or simply a key, of a relational database is any set of columns that have a unique combination of values in each row, with the additional constraint that removing any column could produce duplicate combinations of values.

A candidate key is a minimal superkey,[1] i.e., a superkey that does not contain a smaller one. Therefore, a relation can have multiple candidate keys, each with a different number of attributes.[2]

Specific candidate keys are sometimes called primary keys, secondary keys or alternate keys. The columns in a candidate key are called prime attributes,[3] and a column that does not occur in any candidate key is called a non-prime attribute.

Every relation without NULL values will have at least one candidate key: Since there cannot be duplicate rows, the set of all columns is a superkey, and if that is not minimal, some subset of that will be minimal.

There is a functional dependency from the candidate key to all the attributes in the relation.

The superkeys of a relation are all the possible ways we can identify a row. The candidate keys are the minimal subsets of each superkey and as such, they are an important concept for the design of database schema.

Example

[edit]

The definition of candidate keys can be illustrated with the following (abstract) example. Consider a relation variable (relvar) R with attributes (A, B, C, D) that has only the following two legal values r1 and r2:

r1
A B C D
a1 b1 c1 d1
a1 b2 c2 d1
a2 b1 c2 d1
r2
A B C D
a1 b1 c1 d1
a1 b2 c2 d1
a1 b1 c2 d2

Here r2 differs from r1 only in the A and D values of the last tuple.

For r1 the following sets have the uniqueness property, i.e., there are no two distinct tuples in the instance with the same attribute values in the set:

{A,B}, {A,C}, {B,C}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

For r2 the uniqueness property holds for the following sets;

{B,C}, {B,D}, {C,D}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

Since superkeys of a relvar are those sets of attributes that have the uniqueness property for all legal values of that relvar and because we assume that r1 and r2 are all the legal values that R can take, we can determine the set of superkeys of R by taking the intersection of the two lists:

{B,C}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

Finally we need to select those sets for which there is no proper subset in the list, which are in this case:

{B,C}, {A,B,D}, {A,C,D}

These are indeed the candidate keys of relvar R.

We have to consider all the relations that might be assigned to a relvar to determine whether a certain set of attributes is a candidate key. For example, if we had considered only r1 then we would have concluded that {A,B} is a candidate key, which is incorrect. However, we might be able to conclude from such a relation that a certain set is not a candidate key, because that set does not have the uniqueness property (example {A,D} for r1). Note that the existence of a proper subset of a set that has the uniqueness property cannot in general be used as evidence that the superset is not a candidate key. In particular, note that in the case of an empty relation, every subset of the heading has the uniqueness property, including the empty set.

Determining candidate keys

[edit]

The set of all candidate keys can be computed e.g. from the set of functional dependencies. To this end we need to define the attribute closure for an attribute set . The set contains all attributes that are functionally implied by .

It is quite simple to find a single candidate key. We start with a set of attributes and try to remove successively each attribute. If after removing an attribute the attribute closure stays the same, then this attribute is not necessary and we can remove it permanently. We call the result . If is the set of all attributes, then is a candidate key.

Actually we can detect every candidate key with this procedure by simply trying every possible order of removing attributes. However there are many more permutations of attributes () than subsets (). That is, many attribute orders will lead to the same candidate key.

There is a fundamental difficulty for efficient algorithms for candidate key computation: Certain sets of functional dependencies lead to exponentially many candidate keys. Consider the functional dependencies which yields candidate keys: . That is, the best we can expect is an algorithm that is efficient with respect to the number of candidate keys.

The following algorithm actually runs in polynomial time in the number of candidate keys and functional dependencies:[4]

function find_candidate_keys(A, F)
    /* A is the set of all attributes and F is the set of functional dependencies */
    K[0] := minimize(A);
    n := 1; /* Number of Keys known so far */
    i := 0; /* Currently processed key */
    while i < n do
        for each α → β ∈ F do
            /* Build a new potential key from the previous known key and the current FD */
            S := α ∪ (K[i] − β);
            /* Search whether the new potential key is part of the already known keys */ 
            found := false;
            for j := 0 to n-1 do
                if K[j] ⊆ S then found := true;
            /* If not, add it */
            if not found then
                K[n] := minimize(S);
                n := n + 1;
        i := i + 1
    return K

The idea behind the algorithm is that given a candidate key and a functional dependency , the reverse application of the functional dependency yields the set , which is a key, too. It may however be covered by other already known candidate keys. (The algorithm checks this case using the 'found' variable.) If not, then minimizing the new key yields a new candidate key. The key insight is that all candidate keys can be created this way.

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In relational database theory, a is defined as a minimal set of one or more attributes within a relation that uniquely identifies each , such that no proper of these attributes can perform the same unique identification. This minimality ensures that the key contains no redundant attributes, distinguishing it from broader superkeys, which are any sets of attributes—including candidate keys and their supersets—that also guarantee tuple uniqueness but may include extraneous elements. Candidate keys play a foundational in maintaining and enabling efficient querying in relational models, as they provide multiple potential options for uniquely referencing records without duplication. Typically, database designers select one candidate key to serve as the , which is enforced by the system to prevent null values and duplicates, while the remaining candidate keys are termed alternate keys and may support secondary indexes or constraints. For instance, in a relation representing students, both a unique student ID and a combination of and birthdate could qualify as candidate keys if they each uniquely identify records without overlap. Beyond identification, candidate keys are integral to database normalization processes, particularly in achieving higher normal forms like Boyce-Codd Normal Form (BCNF), where every determinant in a functional dependency must be a candidate key to eliminate anomalies in insertion, update, and deletion operations. This requirement helps minimize redundancy and dependency preservation across decomposed relations, ensuring robust schema design. In practice, identifying all candidate keys during the logical design phase—often through analysis of functional dependencies—allows for flexible enforcement of referential integrity via foreign keys that reference these unique identifiers.

Definition and Basics

Formal Definition

In relational database theory, a is defined as a minimal set of attributes within a relation schema RR that uniquely identifies each , ensuring that no two tuples share the same combination of values for those attributes, and no proper of the set possesses this property. This minimality means that removing any single attribute from the candidate key would allow duplicate tuples or fail to distinguish between some pairs of tuples. Formally, let KK be a subset of the attributes of relation RR. Then KK is a candidate key if it satisfies the functional dependency KRK \to R (meaning the values of KK determine all attributes in RR), and for every proper subset KKK' \subset K, K↛RK' \not\to R (no smaller subset determines the entire relation). This notation captures the essence of uniqueness through functional dependencies, where KRK \to R implies that the relation projected onto KK has no duplicate tuples. The concept of candidate keys originated in E.F. Codd's foundational , introduced in , where relations could possess multiple nonredundant primary keys, each serving as a . Candidate keys form the minimal basis for , which are non-minimal sets of attributes that also uniquely identify tuples.

Relation to Superkeys

A in a is defined as any set of one or more attributes that uniquely identifies each tuple in a relation, potentially including extraneous attributes that do not contribute to uniqueness. Every candidate key is a , but the converse does not hold; candidate keys represent the minimal subsets of attributes among all that maintain uniqueness. are generated by extending candidate keys with additional attributes, preserving the unique identification property without necessity. For instance, if {A, B} serves as a in a relation, then the expanded set {A, B, C} qualifies as a , though it is not minimal due to the redundant inclusion of C. The primary distinction lies in minimality: permit redundant attributes to achieve , whereas demand irreducibility, ensuring no attribute can be removed without violating .

Properties

Uniqueness and Irreducibility

A in a enforces the uniqueness property, ensuring that no two distinct in a relation share the same value for the attributes comprising the key, thereby providing a one-to-one mapping from key values to tuples. This property, foundational to the , guarantees that each tuple can be distinctly identified without , as articulated in the original where primary keys (a type of candidate key) must uniquely identify each tuple in a nonredundant manner. The irreducibility property complements uniqueness by requiring that no proper of the key's attributes possesses the uniqueness property on its own; in other words, removing any single attribute from the key would result in a set that fails to uniquely identify all tuples. This minimality ensures the key is as concise as possible while maintaining full discriminatory power, and it is verified through analysis of functional dependencies within the relation . These properties collectively bolster by preventing the insertion of duplicate records that could otherwise lead to inconsistencies or redundant information in the database. Furthermore, they support across related tables, as foreign keys can reliably reference candidate keys to establish valid links without risking mismatches due to non-unique or reducible identifiers. In SQL database management systems, the property is enforced through UNIQUE constraints, which prevent duplicate values in the specified columns and automatically create a unique index for efficient checking. However, since standard UNIQUE constraints permit NULL values (treating multiple NULLs as non-duplicate), candidate keys require additional NOT NULL constraints on all attributes to fully emulate the non-null mandated by relational theory, distinguishing them from mere unique sets that might allow nulls.

Composition and Cardinality

A candidate key may consist of a single attribute, referred to as a simple candidate key, when that attribute alone uniquely identifies each in a relation. In the , such a key is a domain whose values are nonredundant and sufficient for unique identification, as exemplified by a dedicated field. When no individual attribute provides , a candidate key becomes composite, formed by the combination of two or more attributes that together uniquely identify tuples. This structure ensures nonredundancy, meaning no of the attributes can be removed without losing the unique identification . For instance, a composite candidate key might involve {LastName, FirstName, BirthDate} to distinguish records where single attributes overlap. The of a candidate key, defined as the number of attributes it comprises, varies across different relations and even within the same relation, where multiple candidate keys of differing lengths may exist. A relation might have one simple candidate key and another with higher , reflecting the diverse ways can be achieved based on the attribute's functional dependencies. All such compositions presuppose as a fundamental . Larger candidate keys impose greater storage demands due to the increased size of index entries and key values, which in turn elevate indexing overhead and can degrade query performance, especially in operations involving joins or searches on primary keys derived from these candidates.

Identification Methods

Using Functional Dependencies

A (FD) is a constraint on a relation that specifies, for sets of attributes XX and YY, that if two tuples agree on all attributes in XX, they must agree on all attributes in YY, denoted as XYX \to Y. This relation holds if no two distinct tuples with the same XX-values have differing YY-values, ensuring determination. Functional dependencies provide the theoretical foundation for identifying candidate keys, where a candidate key is a minimal set of attributes XX such that XRX \to R, meaning XX functionally determines all attributes in the relation schema RR. Given a set of FDs FF over RR, candidate keys are those minimal superkeys derived from FF, as superkeys are non-minimal sets satisfying the same determination property. To determine if a set XX is a superkey, compute its attribute closure X+X^+ with respect to FF, defined as the set of all attributes functionally determined by XX using the FDs in FF. XX is a superkey if X+=RX^+ = R, and it is a candidate key if no proper subset of XX has a closure equal to RR, confirming minimality. The closure X+X^+ is derived by applying Armstrong's axioms repeatedly: reflexivity (if YXY \subseteq X, then XYX \to Y), augmentation (if XYX \to Y, then XZYZXZ \to YZ), and transitivity (if XYX \to Y and YZY \to Z, then XZX \to Z). These axioms are sound and complete, generating all implied FDs in the closure F+F^+. Before computing closures for key identification, it is often useful to derive a canonical cover of FF, which is a minimal equivalent set of FDs with no redundant attributes or dependencies. The process involves removing extraneous attributes from the left and right sides of each FD and eliminating redundant FDs, resulting in a simplified set that preserves the semantics of FF for closure computations. This reduction aids in efficient verification of key minimality without altering the implied dependencies.

Computational Algorithms

Computational algorithms for identifying candidate keys from a relational schema typically rely on the given set of functional dependencies (FDs) as input. These methods aim to systematically determine minimal superkeys that uniquely identify tuples in the relation. The attribute closure algorithm forms the foundation for key discovery by iteratively expanding a starting set of attributes using the FDs until no further attributes can be added. To apply it for candidate keys, the closure of each possible subset of attributes is computed; a subset qualifies as a superkey if its closure encompasses all attributes in the schema, and it is minimal (a candidate key) if removing any attribute from it results in a non-superkey. This process, while straightforward, requires checking multiple subsets to ensure minimality. Minimal key finders build on closure computations through techniques that test attribute subsets for status, incorporating optimizations to mitigate exponential growth in complexity. One such approach exploits the arrangement of attributes in FD graphs to identify essential attributes first—those not on the right-hand side of any FD—and then builds keys by combining them with non-essential ones via graph connectivity , avoiding exhaustive in many cases. For instance, starting from a of all attributes and iteratively reducing subsets while verifying closures prunes redundant checks. Practical tools and software facilitate automated computation, often integrating FD mining with key identification. In database management systems like and , schema analysis features in tools such as or support visualizing dependencies and manually verifying keys, though full automation typically requires extensions. Open-source libraries, such as the Python-based FDTool, mine minimal FDs from tabular data and directly infer candidate keys using closure-based methods, providing outputs like equivalent attribute sets for large datasets. Regarding complexity, computing all candidate keys is NP-hard in the worst case due to the need to enumerate and verify minimal transversals over the FD set. However, for acyclic schemas—where the dependency has no cycles—the problem reduces to polynomial time via efficient traversal algorithms. Heuristics, such as level-wise lattice search with pruning in tools like FDTool, enable scalable application to large datasets by focusing on promising subsets early.

Examples

Single-Attribute Candidate Key

A single-attribute candidate key occurs in a relation where one attribute alone uniquely identifies each , assuming it is unique and non-null throughout the relation. For instance, in an Employees relation, the EmployeeID attribute serves as the sole candidate key, ensuring no two employees share the same identifier. This setup leverages the uniqueness property to prevent duplicates, allowing reliable entity identification without additional attributes. Consider the relation schema R(EmployeeID,Name,Department)R(\text{EmployeeID}, \text{Name}, \text{Department}), where the EmployeeID{Name,Department}\text{EmployeeID} \to \{\text{Name}, \text{Department}\} holds, meaning the value of EmployeeID determines the values of the other attributes. Here, EmployeeID functions as the minimal set required for unique identification, illustrating the simplicity of a single-attribute in relational schemas. The use of a single-attribute candidate key offers several practical benefits in database design. It enables efficient indexing, as the database management system can create a compact index on just one attribute for fast lookups and retrievals. Additionally, it minimizes storage overhead by avoiding the need for multiple attributes in keys, reducing the size of indexes and join operations. Joins become straightforward, as referencing a single attribute simplifies queries across related tables without complex composite matching. In real-world applications, single-attribute candidate keys are commonly implemented as surrogate keys, such as auto-incrementing integer IDs in SQL tables, which are system-generated to provide a simple, artificial . For example, in SQL Server, this can be specified using the IDENTITY property, while employs AUTO_INCREMENT to automatically assign sequential values.

Composite Candidate Key

A composite candidate key arises when no single attribute in a relation can uniquely identify each , necessitating the combination of multiple attributes to achieve uniqueness while maintaining minimality—meaning the removal of any attribute from the set would violate this . This irreducibility ensures that the key is as concise as possible without . In theory, such keys are essential for scenarios where individual attributes alone are insufficient due to the inherent multiplicity in real-world data relationships. Consider a relation named Orders with attributes OrderDate, CustomerID, Product, and . Here, {OrderDate, CustomerID} functions as a composite candidate key, as neither attribute alone uniquely identifies a —a single customer can place multiple orders, and the same date can involve orders from various customers, but their combination ensures distinctness. The relevant set includes {OrderDate, CustomerID} → {Product, }, demonstrating full determination by the composite key, whereas OrderDate does not functionally determine CustomerID, and vice versa, confirming that no proper qualifies as a key. This structure highlights the complexity of modeling temporal and entity-based uniqueness in transactional data. Composite candidate keys introduce challenges such as elevated join costs in query execution, as the multi-attribute nature results in larger index sizes and more complex matching during table joins compared to single-attribute keys. Furthermore, they heighten the potential for partial dependencies in unnormalized relations, where a non-key attribute might depend on only one component of the key, fostering insertion, update, and deletion anomalies that complicate . In practice, composite candidate keys often appear in legacy database systems employing non-surrogate natural keys to leverage existing business data without artificial identifiers. For instance, in a Books relation, {ISBN, Edition} may serve as a composite candidate key, accommodating cases where different editions of a share core identifiers but require distinction for inventory and cataloging purposes. This approach preserves semantic richness but demands careful schema design to mitigate performance overheads.

Applications and Relations

Selection as Primary Key

In relational database design, selecting a primary key from multiple candidate keys involves evaluating attributes based on , stability, and to ensure optimal performance and . Simplicity favors single-attribute keys over composites, as they reduce complexity in joins and indexing, while low-cardinality numeric attributes are preferred for their compact storage and faster comparisons. Stability prioritizes keys with values that rarely change, minimizing update across related tables. Efficiency considers indexing overhead, where smaller data types like integers outperform variable-length strings in query execution and storage. These criteria guide database administrators to designate one candidate key as primary, ensuring it uniquely identifies rows without nulls or duplicates. The remaining candidate keys are designated as alternate keys, which maintain uniqueness but do not serve as the primary identifier. These are enforced through UNIQUE constraints, allowing multiple unique identifiers per table while avoiding the stricter not-null requirement of s. Alternate keys support flexible querying and without impacting the clustered index structure tied to the primary key. In SQL implementations, the selection is formalized using the ALTER TABLE statement to add a constraint on the chosen candidate key columns. This automatically creates a unique clustered index on those columns in most management systems, optimizing row retrieval and enforcing integrity at the storage level. For instance, nonclustered primary keys can be specified if a separate clustered index is preferred, balancing access patterns. Trade-offs in selection often contrast natural keys, derived from business data like identifiers, against surrogate keys, which are system-generated artificial values such as integers or UUIDs. Natural keys leverage inherent domain logic but risk performance degradation from larger sizes or changes, whereas surrogate keys enhance stability and scalability in distributed environments by decoupling from business volatility, though they require additional unique constraints on natural alternates. In high-volume systems, reduce insert/delete costs in some cases but may increase overall index .

Role in Database Normalization

Candidate keys play a pivotal role in by serving as the foundational elements for identifying and eliminating partial, transitive, and other dependencies that lead to data redundancies and update anomalies. In the normalization process, these keys define the minimal sets of attributes that uniquely identify tuples, enabling designers to decompose relations into smaller, dependency-free components while preserving the relational structure. This ensures that non-key attributes depend solely on entire candidate keys, rather than subsets or intermediaries, thereby maintaining across operations like insertion, deletion, and modification. In (2NF) and (3NF), candidate keys are essential for detecting partial and transitive dependencies. A relation achieves 2NF when it is in and no non-prime attribute depends on a proper of any candidate key; if such a partial dependency exists, is required to isolate the dependent attributes into a separate relation projected over the and the dependents. For 3NF, the process extends to transitive dependencies, where non-prime attributes must depend directly on candidate keys rather than other non-prime attributes; violations prompt to ensure that all determinants involving non-prime attributes are tied to full candidate keys. This key-centric approach in 2NF and 3NF guarantees that updates to non-key data do not propagate redundantly across tuples. Boyce-Codd Normal Form (BCNF) further refines this by mandating that every determinant in a must be a , addressing cases where 3NF relations still harbor anomalies due to non-key determinants. If a dependency α → β holds where α is not a (and thus not a ), the relation violates BCNF, necessitating into projections over α ∪ β and α ∪ (R - β), with the original dependencies projected accordingly to maintain equivalence. This stricter reliance on eliminates more subtle redundancies, such as those arising from overlapping keys, though it may not always preserve all dependencies without loss. For higher normal forms like (4NF) and (5NF), candidate keys aid in detecting multivalued dependencies (MVDs) and join dependencies that exceed constraints. In 4NF, a relation is normalized if, for every non-trivial MVD α →→ β, α is a , meaning candidate keys help identify and decompose relations where independent multi-valued facts about a key lead to spurious tuples. Similarly, 5NF requires that every join dependency is implied solely by the candidate keys of the relation, prompting into cyclic projections to resolve complex interdependencies without redundancy. These key-based checks ensure that normalization handles advanced anomaly scenarios in relations with multiple independent associations. The benefits of leveraging candidate keys in normalization include ensuring lossless decomposition, where the join of projected relations reconstructs the original without spurious tuples, as verified by the keys' role in the chase algorithm or dependency closure. Additionally, key-centered decompositions promote dependency preservation, particularly in 3NF, allowing local enforcement of functional dependencies without global recomputation, which enhances query efficiency and data consistency in relational databases.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.