Recent from talks
Nothing was collected or created yet.
Primary key
View on WikipediaThis article relies largely or entirely on a single source. (June 2021) |
In the relational model of databases, a primary key is a designated set of attributes (column(s)) that can reliably identify and distinguish between each individual record in a table. The database creator can choose an existing unique attribute or combination of attributes from the table (a natural key) to act as its primary key, or create a new attribute containing a unique ID that exists solely for this purpose (a surrogate key).
Examples of natural keys that could be suitable primary keys include data that is already by definition unique to all items in the table such as a national identification number attribute for person records, or the combination of a very precise timestamp attribute with a very precise location attribute for event records.
More formally, a primary key is a specific choice of a minimal set of attributes that uniquely specify a tuple (row) in a relation (table).[a][1] A primary key is a choice[clarification needed] of a candidate key (a minimal superkey); any other candidate key is an alternate key.
Design
[edit]In relational database terms, a primary key does not differ in form or function from a key that isn't primary. In practice, various motivations may determine the choice of any one key as primary over another. The designation of a primary key may indicate the "preferred" identifier for data in the table, or that the primary key is to be used for foreign key references from other tables or it may indicate some other technical rather than semantic feature of the table. Some languages and software have special syntax features that can be used to identify a primary key as such (e.g. the PRIMARY KEY constraint in SQL).
The relational model, as expressed through relational calculus and relational algebra, does not distinguish between primary keys and other kinds of keys. Primary keys were added to the SQL standard mainly as a convenience to the application programmer.[citation needed]
Primary keys can be an integer that is incremented, a universally unique identifier (UUID) or can be generated using Hi/Lo algorithm.
Defining primary keys in SQL
[edit]Primary keys are defined in the ISO SQL Standard, through the PRIMARY KEY constraint. The syntax to add such a constraint to an existing table is defined in SQL:2003 like this:
ALTER TABLE <table identifier>
ADD [ CONSTRAINT <constraint identifier> ]
PRIMARY KEY ( <column name> [ {, <column name> }... ] )
The primary key can also be specified directly during table creation. In the SQL Standard, primary keys may consist of one or multiple columns. Each column participating in the primary key is implicitly defined as NOT NULL. Note that some RDBMS require explicitly marking primary key columns as NOT NULL.[citation needed]
CREATE TABLE table_name (
...
)
If the primary key consists only of a single column, the column can be marked as such using the following syntax:
CREATE TABLE table_name (
id_col INT PRIMARY KEY,
col2 CHARACTER VARYING(20),
...
)
Surrogate keys
[edit]In some circumstances the natural key that uniquely identifies a tuple in a relation may be cumbersome to use for software development. For example, it may involve multiple columns or large text fields. In such cases, a surrogate key can be used instead as the primary key. In other situations there may be more than one candidate key for a relation, and no candidate key is obviously preferred. A surrogate key may be used as the primary key to avoid giving one candidate key artificial primacy over the others.
Since primary keys exist primarily as a convenience to the programmer, surrogate primary keys are often used, in many cases exclusively, in database application design.
Due to the popularity of surrogate primary keys, many developers and in some cases even theoreticians have come to regard surrogate primary keys as an inalienable part of the relational data model. This is largely due to a migration of principles from the object-oriented programming model to the relational model, creating the hybrid object–relational model. In the ORM like active record pattern, these additional restrictions are placed on primary keys:
- Primary keys should be immutable, that is, never changed or re-used; they should be deleted along with the associated record.
- Primary keys should be anonymous integer or numeric identifiers.
However, neither of these restrictions is part of the relational model or any SQL standard. Due diligence should be applied when deciding on the immutability of primary key values during database and application design. Some database systems even imply that values in primary key columns cannot be changed using the UPDATE SQL statement.[citation needed]
Alternate key
[edit]Typically, one candidate key is chosen as the primary key. Other candidate keys become alternate keys, each of which may have a UNIQUE constraint assigned to it in order to prevent duplicates (a duplicate entry is not valid in a unique column).[2]
Alternate keys may be used like the primary key when doing a single-table select or when filtering in a where clause, but are not typically used to join multiple tables.
See also
[edit]Notes
[edit]- ^ Corresponding terms are respectively theoretical (attribute, tuple, relation) and concrete (column, row, table).
References
[edit]- ^ "Add or change a table's primary key in Access". Microsoft. Retrieved January 20, 2020.
A primary key is a field or set of fields with values that are unique throughout a table.
- ^ Alternate key – Oracle FAQ
Primary key
View on GrokipediaFundamentals
Definition and Purpose
A primary key is one or more columns in a relational database table that uniquely identifies each row, or tuple, ensuring entity integrity by guaranteeing that no two rows share the same key value.[2] This uniqueness prevents duplicate records and ambiguous references within the table, forming a foundational mechanism for maintaining data consistency in relational systems.[5] In the relational model, the primary key supports referential integrity by serving as the target for foreign keys in other tables, which enforce valid relationships between entities and prevent orphaned records.[6] It also enables efficient joins between tables, allowing queries to combine data across relations based on matching key values, thus facilitating complex data retrieval without redundancy.[7] These functions were central to Edgar F. Codd's 1970 relational model, where primary keys provide logical identifiers for tuples, replacing physical pointers to promote data independence and integrity.[8] As a unique identifier for entities in data modeling, the primary key underpins one-to-many relationships, where a single primary key value in one table can link to multiple foreign key instances in another.[7] It is essential for normalization processes, such as achieving first normal form (1NF) by ensuring row uniqueness and second normal form (2NF) by requiring non-key attributes to depend fully on the entire primary key rather than subsets.[9] This role helps eliminate anomalies and supports scalable, maintainable database designs.[10]Key Properties
A primary key in a relational database must ensure uniqueness, meaning that every value (or combination of values in the case of a composite key) in the primary key column or columns is distinct across all rows in the table, preventing duplicates and allowing each row to be reliably identified. This property is fundamental to the relational model, as originally defined by E. F. Codd, where a primary key is a domain or combination of domains that uniquely identifies each tuple in a relation.[8] Modern database management systems (DBMS) enforce this through automatic creation of a unique index on the primary key columns.[2] Primary keys also require non-nullability, prohibiting NULL values in the designated columns, since NULLs would undermine uniqueness and the ability to identify rows definitively. All columns in a primary key must be explicitly defined as NOT NULL, and DBMS like SQL Server and PostgreSQL automatically apply this constraint when a primary key is declared.[2][11] This ensures entity integrity, guaranteeing that no row lacks a valid identifier. The immutability of primary key values is a critical design principle to preserve referential stability, particularly in tables linked by foreign keys; changes to primary key values are discouraged and, if necessary, typically require deleting and re-inserting the affected rows to avoid cascading updates. While DBMS do not strictly enforce immutability, updating a primary key can complicate relationships and data consistency, as noted in SQL Server documentation on key modifications.[12] A table permits exactly one primary key, though it may consist of multiple columns forming a composite key, providing flexibility while maintaining a single unique identifier per relation. This restriction aligns with relational theory, where one nonredundant key is selected as primary from potentially multiple candidates.[8][2][11] Minimalism dictates that the primary key include only the essential columns needed to achieve uniqueness, avoiding superfluous attributes to keep the key as simple and efficient as possible; Codd emphasized nonredundancy, ensuring no participating domain is functionally dependent on the others in the combination.[8] Finally, enforcement occurs at the row level by the DBMS, which validates inserts, updates, and deletes against the primary key constraints to uphold data integrity, while automatically indexing the key for efficient lookups and joins. In systems like PostgreSQL, this involves creating a unique B-tree index, and in SQL Server, a clustered index by default unless specified otherwise.[2][11]Design Considerations
Natural Keys
Natural keys are primary keys derived from attributes that exist in the real-world data and inherently uniquely identify entities within a database relation, such as a Social Security Number (SSN) for individuals or an International Standard Book Number (ISBN) for books.[13][14] These keys leverage domain-specific data that holds logical meaning, distinguishing them from artificial identifiers.[15] One key advantage of natural keys is their semantic value, making them human-readable and intuitive for business users, as they directly reflect the entity's characteristics without requiring additional lookup.[16] They also impose no extra storage overhead beyond the existing data and can enforce business rules intrinsically, such as uniqueness mandated by external standards like ISBN allocation.[13] In stable domains, natural keys promote efficiency in querying by reducing the need for joins when relationships rely on meaningful attributes.[17] However, natural keys carry significant disadvantages, including potential instability due to real-world changes, such as updates to an employee's name or address, which can necessitate cascading modifications across related tables.[14] They may introduce scalability challenges in large datasets, where composite natural keys (e.g., combining multiple fields) lead to wider indexes and slower joins compared to compact identifiers.[17] Privacy and security risks are particularly acute, as natural keys often comprise personal identifiable information (PII) like SSNs or email addresses, exposing sensitive data and complicating compliance with regulations such as the EU's General Data Protection Regulation (GDPR), which emphasizes the "right to be forgotten" and minimization of personal data processing.[18] Non-uniqueness can also arise from real-world errors, such as duplicate entries due to data entry mistakes.[15] Selection criteria for natural keys focus on domains where the attributes are stable, guaranteed unique by external rules, and non-null, such as a Vehicle Identification Number (VIN) for automobiles or a product Stock Keeping Unit (SKU) in inventory systems.[19] They are suitable for employee records using a stable employee ID assigned by HR policies, provided the data remains immutable and verifiable.[17] Natural keys should be avoided in scenarios prone to frequent changes or high privacy sensitivity, where surrogate keys offer greater abstraction and stability.[18] In relational database normalization, natural keys support higher normal forms by capitalizing on functional dependencies inherent to the business domain, ensuring that each non-key attribute depends on the key without redundancy.[20] This alignment with real-world semantics aids in achieving third normal form (3NF) or beyond, as the keys naturally enforce determinacy in attribute relationships.[14] Common pitfalls include over-reliance on composite natural keys, which complicate queries and maintenance due to their multi-column nature, and failing to account for privacy implications, such as using personal IDs in publicly accessible systems, potentially leading to GDPR violations through unintended data exposure.[18]Surrogate Keys
Surrogate keys are artificial identifiers generated by the database system, typically as numeric or string values with no inherent business meaning, employed as primary keys when natural keys prove unstable, composite, or otherwise unsuitable for uniquely identifying records.[21][22] These keys offer several advantages, including guaranteed uniqueness without reliance on changing business data, which ensures stability even if underlying attributes like customer emails or product codes are updated.[23] They simplify implementation by automating value assignment, facilitate efficient indexing and join operations due to their compact, sequential nature, and mitigate privacy risks by avoiding exposure of sensitive natural identifiers in queries or APIs.[21][24] However, surrogate keys introduce drawbacks such as increased storage requirements from an additional column per table, reduced user interpretability that necessitates maintaining separate natural keys for reporting and auditing, and risks of collisions or coordination challenges in distributed systems where centralized generation may bottleneck scalability.[23][22] Common generation methods for surrogate keys include auto-increment mechanisms like IDENTITY columns in SQL Server, which produce sequential integers ideal for single-node databases but prone to gaps or exhaustion in high-volume scenarios.[21] Database sequences, as used in Oracle and PostgreSQL, provide reusable integer generators that support custom incrementing for better control, though they require central coordination that can hinder performance in distributed environments.[21] For distributed systems, universally unique identifiers (UUIDs) or GUIDs offer decentralized generation without coordination, enabling offline or multi-node inserts, but their larger size (128 bits) increases storage and indexing overhead compared to integers.[22][25] Surrogate keys find application in environments with frequent data changes, such as user account tables where identifiers like emails may alter, or in dataset merging across sources where natural keys overlap or lack stability.[21] They are particularly valuable in distributed and cloud databases, like YugabyteDB adaptations or BigQuery, where UUID generation supports scalable, partition-tolerant inserts without central bottlenecks, accommodating composite or unstable natural keys in IoT or multi-system integrations.[22][25] Best practices recommend employing 64-bit integers (e.g., BIGINT) for surrogate keys to ensure scalability up to billions of records without overflow, while avoiding their exposure in external APIs to prevent enumeration attacks or unintended data leakage.[21] In distributed setups, prefer UUID variants like v4 for randomness or v7 for time-ordering to balance uniqueness with query efficiency.[22]Implementation
Defining in SQL
In standard SQL, a primary key is defined during table creation using theCREATE TABLE statement, either inline within a column definition for single-column keys or as a table constraint for single or composite keys. For a single-column primary key inline, the syntax is column_name data_type PRIMARY KEY, ensuring the column uniquely identifies each row and implicitly enforces NOT NULL.[26] For example:
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(50)
);
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(50)
);
[CONSTRAINT constraint_name] PRIMARY KEY (column1, column2, ...), placed after all column definitions. This allows the combination of columns to serve as the unique identifier. An example in an e-commerce schema for an orders table might be:
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
PRIMARY KEY (order_id, customer_id)
);
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
PRIMARY KEY (order_id, customer_id)
);
ALTER TABLE statement with the syntax ALTER TABLE table_name ADD [CONSTRAINT constraint_name] PRIMARY KEY (column1 [, column2, ...]). For instance, adding a primary key to an existing employees table:
ALTER TABLE employees
ADD CONSTRAINT pk_employees PRIMARY KEY (id);
ALTER TABLE employees
ADD CONSTRAINT pk_employees PRIMARY KEY (id);
ALTER TABLE table_name DROP CONSTRAINT constraint_name (or DROP PRIMARY KEY in some systems without a named constraint), which removes the uniqueness enforcement but leaves the data intact.[29]
Database management systems (DBMS) adhere to ANSI SQL but include variations for auto-incrementing primary keys. The SQL standard provides the GENERATED [ALWAYS | BY DEFAULT] AS IDENTITY clause (introduced in SQL:2003 and included in SQL:2023) to define auto-incrementing columns that can serve as primary keys, ensuring portability across compliant DBMS. For example:
CREATE TABLE products (
product_id INTEGER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE products (
product_id INTEGER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
name VARCHAR(100)
);
AUTO_INCREMENT PRIMARY KEY, as in:
CREATE TABLE products (
product_id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE products (
product_id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100)
);
SERIAL pseudo-type creates an auto-incrementing integer with a default sequence, defined as column_name SERIAL PRIMARY KEY, which internally uses INTEGER DEFAULT nextval('sequence_name') PRIMARY KEY for compatibility with SQL standards. For example:
CREATE TABLE departments (
dept_id SERIAL PRIMARY KEY,
dept_name VARCHAR(50)
);
CREATE TABLE departments (
dept_id SERIAL PRIMARY KEY,
dept_name VARCHAR(50)
);
Constraints and Enforcement
Database Management Systems (DBMS) enforce primary key constraints to maintain data integrity by validating that primary key values are unique and non-null during INSERT and UPDATE operations. This validation occurs automatically at the row level, preventing the insertion or modification of data that would violate these rules. For tables with foreign key relationships, changes to primary key values can trigger cascading actions defined in the foreign key constraints, such as updates or deletions propagating to dependent tables to preserve referential integrity.[11][2][32] Primary keys automatically generate unique indexes to support efficient data access, typically using B-tree structures that enable logarithmic-time lookups and range queries, though hash indexes may be used in specific equality-only scenarios for constant-time access. In systems like SQL Server, the primary key index defaults to clustered, organizing the table data physically around the key for optimal retrieval. Oracle and PostgreSQL create unique B-tree indexes if none exists, ensuring enforcement without additional manual configuration.[11][2][32][33] These indexes accelerate query performance through faster lookups and joins but introduce overhead on write operations, as each INSERT, UPDATE, or DELETE requires index maintenance, potentially leading to fragmentation in high-update environments. Fragmentation occurs when data pages split or become sparse, increasing I/O and slowing scans; regular reorganization or rebuilding mitigates this by compacting pages and updating statistics. In modern DBMS like Oracle 19c, automated index optimization features further reduce maintenance needs during heavy workloads.[34][35] Violations of primary key constraints, such as duplicate values, trigger errors and typically roll back the transaction to prevent invalid data entry; for example, PostgreSQL returns SQLSTATE 23505 for uniqueness breaches. Oracle uses ORA-00001 for constraint violations, allowing exceptions to be logged into tables for analysis without full rollback. SQL Server similarly aborts the operation with error messages indicating the failed constraint.[11][32] Maintenance of primary key indexes involves periodic rebuilding to address fragmentation or corruption, using commands like REINDEX in PostgreSQL, ALTER INDEX REBUILD in Oracle and SQL Server. Changing a primary key value is rare and often handled by deleting and reinserting the row, as direct updates may fail due to index dependencies and foreign key references. In Oracle 19c, advanced features like automatic index creation and real-time statistics enhance ongoing optimization.[36][37][34] From a security perspective, primary keys serve as critical access points in multi-user environments, where role-based access control (RBAC) enforces privileges for creating, altering, or enforcing constraints, preventing unauthorized modifications. DBMS roles restrict who can insert or update primary key values, integrating with broader security models to protect data uniqueness and integrity.[32][38][39]Related Concepts
Candidate Keys
A candidate key is a minimal set of attributes in a relational database table that uniquely identifies each tuple, satisfying the uniqueness and non-null properties required of a primary key, with the potential for multiple candidate keys per table.[8] Unlike a primary key, which is a single designated candidate key, all candidate keys ensure that no two rows share the same values in those attributes, and none can be omitted without losing uniqueness.[40] Candidate keys are identified through analysis of functional dependencies (FDs) in entity-relationship modeling, where an attribute set is a candidate key if its closure includes all attributes in the relation and no proper subset does so.[41] For instance, in a student table with attributes {StudentID, Email, Birthdate, Name}, if FDs include StudentID → {Email, Birthdate, Name} and {Email, Birthdate} → {StudentID, Name}, then both {StudentID} and {Email, Birthdate} qualify as candidate keys, as each minimally determines all other attributes.[42] From these candidate keys, database designers select one to serve as the primary key based on criteria such as stability (minimal change over time), simplicity (preferably a single attribute), and frequency of use in queries or relationships.[43] The chosen primary key supports efficient indexing and referential integrity, while the remaining candidates become alternate keys for additional uniqueness constraints.[44] In database design, candidate keys play a crucial role in normalization; for example, second normal form (2NF) requires that every non-prime attribute be fully functionally dependent on each candidate key, eliminating partial dependencies on any subset of a composite candidate key.[45] Entity-relationship diagrams and tools like attribute closure algorithms help enumerate candidate keys during schema design to ensure relational integrity.[41] Consider a bank account table with attributes {AccountNumber, CustomerID, BranchCode, Balance}. If FDs are AccountNumber → {CustomerID, BranchCode, Balance} and {CustomerID, BranchCode} → {AccountNumber, Balance}, then candidate keys include {AccountNumber} and {CustomerID, BranchCode}, each uniquely identifying an account without redundancy.[46] Candidate keys differ from superkeys in that they are minimal: a superkey uniquely identifies tuples but may include extraneous attributes, whereas a candidate key has no such subset that preserves uniqueness.[40] This minimality links candidate keys to dependency theory, where they form the basis for deriving all functional dependencies in a relation.[47]Alternate Keys
An alternate key is a candidate key in a relational database that is not selected as the primary key, serving as an additional unique identifier for records.[48] It maintains the same uniqueness property as a candidate key but is designated for secondary identification purposes rather than primary referencing.[48] In SQL implementations, alternate keys are enforced through UNIQUE constraints, which can be defined on single or multiple columns to prevent duplicate values.[49] For instance, on an existing table, the syntax is:ALTER TABLE users ADD CONSTRAINT AK_email UNIQUE (email);
ALTER TABLE users ADD CONSTRAINT AK_email UNIQUE (email);
user_id (an auto-incrementing integer), an alternate key on username ensures unique user handles for login purposes:
CREATE TABLE users (
user_id INT PRIMARY KEY AUTO_INCREMENT,
username VARCHAR(50),
[email](/page/Email) VARCHAR(100),
CONSTRAINT AK_username UNIQUE (username)
);
CREATE TABLE users (
user_id INT PRIMARY KEY AUTO_INCREMENT,
username VARCHAR(50),
[email](/page/Email) VARCHAR(100),
CONSTRAINT AK_username UNIQUE (username)
);
(last_name, first_name, date_of_birth) for uniquely identifying individuals where no single field suffices.[49]
Although SQL standards allow foreign keys to reference columns under UNIQUE constraints (implementing alternate keys), strict relational models prefer referencing primary keys to uphold a clear hierarchical structure and avoid ambiguity in referential integrity.[52]