Hubbry Logo
Relational databaseRelational databaseMain
Open search
Relational database
Community hub
Relational database
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Relational database
Relational database
from Wikipedia

A relational database (RDB[1]) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.[2]

A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured format using rows and columns.

Many relational database systems are equipped with the option of using SQL (Structured Query Language) for querying and updating the database.[3]

History

[edit]

The concept of relational database was defined by E. F. Codd at IBM in 1970. Codd introduced the term relational in his research paper "A Relational Model of Data for Large Shared Data Banks".[2] In this paper and later papers, he defined what he meant by relation. One well-known definition of what constitutes a relational database system is composed of Codd's 12 rules.

However, no commercial implementations of the relational model conform to all of Codd's rules,[4] so the term has gradually come to describe a broader class of database systems, which at a minimum:

  1. Present the data to the user as relations (a presentation in tabular form, i.e. as a collection of tables with each table consisting of a set of rows and columns);
  2. Provide relational operators to manipulate the data in tabular form.

In 1974, IBM began developing System R, a research project to develop a prototype RDBMS.[5][6] The first system sold as an RDBMS was Multics Relational Data Store (June 1976).[7][8][citation needed] Oracle was released in 1979 by Relational Software, now Oracle Corporation.[9] Ingres and IBM BS12 followed. Other examples of an RDBMS include IBM Db2, SAP Sybase ASE, and Informix. In 1984, the first RDBMS for Macintosh began being developed, code-named Silver Surfer, and was released in 1987 as 4th Dimension and known today as 4D.[10]

The first systems that were relatively faithful implementations of the relational model were from:

  • University of Michigan – Micro DBMS (1969)[11]
  • Massachusetts Institute of Technology (1971)[12]
  • IBM UK Scientific Centre at Peterlee – IS1 (1970–72),[13] and its successor, PRTV (1973–79).[14]

The most common definition of an RDBMS is a product that presents a view of data as a collection of rows and columns, even if it is not based strictly upon relational theory. By this definition, RDBMS products typically implement some but not all of Codd's 12 rules.

A second school of thought argues that if a database does not implement all of Codd's rules (or the current understanding on the relational model, as expressed by Christopher J. Date, Hugh Darwen and others), it is not relational. This view, shared by many theorists and other strict adherents to Codd's principles, would disqualify most DBMSs as not relational. For clarification, they often refer to some RDBMSs as truly-relational database management systems (TRDBMS), naming others pseudo-relational database management systems (PRDBMS).[citation needed]

As of 2009, most commercial relational DBMSs employ SQL as their query language.[15]

Alternative query languages have been proposed and implemented, notably the pre-1996 implementation of Ingres QUEL.

Relational model

[edit]

A relational model organizes data into one or more tables (or "relations") of columns and rows, with a unique key identifying each row. Rows are also called records or tuples.[16] Columns are also called attributes. Generally, each table/relation represents one "entity type" (such as customer or product). The rows represent instances of that type of entity (such as "Lee" or "chair") and the columns represent values attributed to that instance (such as address or price).

For example, each row of a class table corresponds to a class, and a class corresponds to multiple students, so the relationship between the class table and the student table is "one to many"[17]

Keys

[edit]

Each row in a table has its own unique key. Rows in a table can be linked to rows in other tables by adding a column for the unique key of the linked row (such columns are known as foreign keys). Codd showed that data relationships of arbitrary complexity can be represented by a simple set of concepts.[2]

Part of this processing involves consistently being able to select or modify one and only one row in a table. Therefore, most physical implementations have a unique primary key (PK) for each row in a table. When a new row is written to the table, a new unique value for the primary key is generated; this is the key that the system uses primarily for accessing the table. System performance is optimized for PKs. Other, more natural keys may also be identified and defined as alternate keys (AK). Often several columns are needed to form an AK (this is one reason why a single integer column is usually made the PK). Both PKs and AKs have the ability to uniquely identify a row within a table. Additional technology may be applied to ensure a unique ID across the world, a globally unique identifier, when there are broader system requirements.

The primary keys within a database are used to define the relationships among the tables. When a PK migrates to another table, it becomes a foreign key (FK) in the other table. When each cell can contain only one value and the PK migrates into a regular entity table, this design pattern can represent either a one-to-one or one-to-many relationship. Most relational database designs resolve many-to-many relationships by creating an additional table that contains the PKs from both of the other entity tables – the relationship becomes an entity; the resolution table is then named appropriately and the two FKs are combined to form a PK. The migration of PKs to other tables is the second major reason why system-assigned integers are used normally as PKs; there is usually neither efficiency nor clarity in migrating a bunch of other types of columns.

Relationships

[edit]

Relationships are a logical connection between different tables (entities), established on the basis of interaction among these tables. These relationships can be modelled as an entity-relationship model.

Transactions

[edit]

In order for a database management system (DBMS) to operate efficiently and accurately, it must use ACID transactions.[18][19][20]

Stored procedures

[edit]

Part of the programming within a RDBMS is accomplished using stored procedures (SPs). Often procedures can be used to greatly reduce the amount of information transferred within and outside of a system. For increased security, the system design may grant access to only the stored procedures and not directly to the tables. Fundamental stored procedures contain the logic needed to insert new and update existing data. More complex procedures may be written to implement additional rules and logic related to processing or selecting the data.

Terminology

[edit]
Relational database terminology

The relational database was first defined in June 1970 by Edgar Codd, of IBM's San Jose Research Laboratory.[2] Codd's view of what qualifies as an RDBMS is summarized in Codd's 12 rules. A relational database has become the predominant type of database. Other models besides the relational model include the hierarchical database model and the network model.

The table below summarizes some of the most important relational database terms and the corresponding SQL term:

SQL term Relational database term Description
Row Tuple or record A data set representing a single item
Column Attribute or field A labeled element of a tuple, e.g. "Address" or "Date of birth"
Table Relation or Base relvar A set of tuples sharing the same attributes; a set of columns and rows
View or result set Derived relvar Any set of tuples; a data report from the RDBMS in response to a query

Relations or tables

[edit]

In a relational database, a relation is a set of tuples that have the same attributes. A tuple usually represents an object and information about that object. Objects are typically physical objects or concepts. A relation is usually described as a table, which is organized into rows and columns. All the data referenced by an attribute are in the same domain and conform to the same constraints.

The relational model specifies that the tuples of a relation have no specific order and that the tuples, in turn, impose no order on the attributes. Applications access data by specifying queries, which use operations such as select to identify tuples, project to identify attributes, and join to combine relations. Relations can be modified using the insert, delete, and update operators. New tuples can supply explicit values or be derived from a query. Similarly, queries identify tuples for updating or deleting.

Tuples by definition are unique. If the tuple contains a candidate or primary key then obviously it is unique; however, a primary key need not be defined for a row or record to be a tuple. The definition of a tuple requires that it be unique, but does not require a primary key to be defined. Because a tuple is unique, its attributes by definition constitute a superkey.

Base and derived relations

[edit]

All data are stored and accessed via relations. Relations that store data are called "base relations", and in implementations are called "tables". Other relations do not store data, but are computed by applying relational operations to other relations. These relations are sometimes called "derived relations". In implementations these are called "views" or "queries". Derived relations are convenient in that they act as a single relation, even though they may grab information from several relations. Also, derived relations can be used as an abstraction layer.

Domain

[edit]

A domain describes the set of possible values for a given attribute, and can be considered a constraint on the value of the attribute. Mathematically, attaching a domain to an attribute means that any value for the attribute must be an element of the specified set. The character string "ABC", for instance, is not in the integer domain, but the integer value 123 is. Another example of domain describes the possible values for the field "CoinFace" as ("Heads","Tails"). So, the field "CoinFace" will not accept input values like (0,1) or (H,T).

Constraints

[edit]

Constraints are often used to make it possible to further restrict the domain of an attribute. For instance, a constraint can restrict a given integer attribute to values between 1 and 10. Constraints provide one method of implementing business rules in the database and support subsequent data use within the application layer. SQL implements constraint functionality in the form of check constraints. Constraints restrict the data that can be stored in relations. These are usually defined using expressions that result in a Boolean value, indicating whether or not the data satisfies the constraint. Constraints can apply to single attributes, to a tuple (restricting combinations of attributes) or to an entire relation. Since every attribute has an associated domain, there are constraints (domain constraints). The two principal rules for the relational model are known as entity integrity and referential integrity.

Primary key

[edit]

Every relation/table has a primary key, this being a consequence of a relation being a set.[21] A primary key uniquely specifies a tuple within a table. While natural attributes (attributes used to describe the data being entered) are sometimes good primary keys, surrogate keys are often used instead. A surrogate key is an artificial attribute assigned to an object which uniquely identifies it (for instance, in a table of information about students at a school they might all be assigned a student ID in order to differentiate them). The surrogate key has no intrinsic (inherent) meaning, but rather is useful through its ability to uniquely identify a tuple. Another common occurrence, especially in regard to N:M cardinality is the composite key. A composite key is a key made up of two or more attributes within a table that (together) uniquely identify a record.[22]

Foreign key

[edit]

Foreign key refers to a field in a relational table that matches the primary key column of another table. It relates the two keys. Foreign keys need not have unique values in the referencing relation. A foreign key can be used to cross-reference tables, and it effectively uses the values of attributes in the referenced relation to restrict the domain of one or more attributes in the referencing relation. The concept is described formally as: "For all tuples in the referencing relation projected over the referencing attributes, there must exist a tuple in the referenced relation projected over those same attributes such that the values in each of the referencing attributes match the corresponding values in the referenced attributes."

Stored procedures

[edit]

A stored procedure is executable code that is associated with, and generally stored in, the database. Stored procedures usually collect and customize common operations, like inserting a tuple into a relation, gathering statistical information about usage patterns, or encapsulating complex business logic and calculations. Frequently they are used as an application programming interface (API) for security or simplicity. Implementations of stored procedures on SQL RDBMS's often allow developers to take advantage of procedural extensions (often vendor-specific) to the standard declarative SQL syntax. Stored procedures are not part of the relational database model, but all commercial implementations include them.

Index

[edit]

An index is one way of providing quicker access to data. Indices can be created on any combination of attributes on a relation. Queries that filter using those attributes can find matching tuples directly using the index (similar to Hash table lookup), without having to check each tuple in turn. This is analogous to using the index of a book to go directly to the page on which the information you are looking for is found, so that you do not have to read the entire book to find what you are looking for. Relational databases typically supply multiple indexing techniques, each of which is optimal for some combination of data distribution, relation size, and typical access pattern. Indices are usually implemented via B+ trees, R-trees, and bitmaps. Indices are usually not considered part of the database, as they are considered an implementation detail, though indices are usually maintained by the same group that maintains the other parts of the database. The use of efficient indexes on both primary and foreign keys can dramatically improve query performance. This is because B-tree indexes result in query times proportional to log(n) where n is the number of rows in a table and hash indexes result in constant time queries (no size dependency as long as the relevant part of the index fits into memory).

Relational operations

[edit]

Queries made against the relational database, and the derived relvars in the database are expressed in a relational calculus or a relational algebra. In his original relational algebra, Codd introduced eight relational operators in two groups of four operators each. The first four operators were based on the traditional mathematical set operations:

  • The union operator (υ) combines the tuples of two relations and removes all duplicate tuples from the result. The relational union operator is equivalent to the SQL UNION operator.
  • The intersection operator (∩) produces the set of tuples that two relations share in common. Intersection is implemented in SQL in the form of the INTERSECT operator.
  • The set difference operator (-) acts on two relations and produces the set of tuples from the first relation that do not exist in the second relation. Difference is implemented in SQL in the form of the EXCEPT or MINUS operator.
  • The cartesian product (X) of two relations is a join that is not restricted by any criteria, resulting in every tuple of the first relation being matched with every tuple of the second relation. The cartesian product is implemented in SQL as the Cross join operator.

The remaining operators proposed by Codd involve special operations specific to relational databases:

  • The selection, or restriction, operation (σ) retrieves tuples from a relation, limiting the results to only those that meet a specific criterion, i.e. a subset in terms of set theory. The SQL equivalent of selection is the SELECT query statement with a WHERE clause.
  • The projection operation (π) extracts only the specified attributes from a tuple or set of tuples.
  • The join operation defined for relational databases is often referred to as a natural join (⋈). In this type of join, two relations are connected by their common attributes. MySQL's approximation of a natural join is the Inner join operator. In SQL, an INNER JOIN prevents a cartesian product from occurring when there are two tables in a query. For each table added to a SQL Query, one additional INNER JOIN is added to prevent a cartesian product. Thus, for N tables in an SQL query, there must be N−1 INNER JOINS to prevent a cartesian product.
  • The relational division (÷) operation is a slightly more complex operation and essentially involves using the tuples of one relation (the dividend) to partition a second relation (the divisor). The relational division operator is effectively the opposite of the cartesian product operator (hence the name).

Other operators have been introduced or proposed since Codd's introduction of the original eight including relational comparison operators and extensions that offer support for nesting and hierarchical data, among others.

Normalization

[edit]

Normalization was first proposed by Codd as an integral part of the relational model. It encompasses a set of procedures designed to eliminate non-simple domains (non-atomic values) and the redundancy (duplication) of data, which in turn prevents data manipulation anomalies and loss of data integrity. The most common forms of normalization applied to databases are called the normal forms.

RDBMS

[edit]
The general structure of a relational database

Connolly and Begg define database management system (DBMS) as a "software system that enables users to define, create, maintain and control access to the database".[23] RDBMS is an extension of that initialism that is sometimes used when the underlying database is relational.

An alternative definition for a relational database management system is a database management system (DBMS) based on the relational model. Most databases in widespread use today are based on this model.[24]

RDBMSs have been a common option for the storage of information in databases used for financial records, manufacturing and logistical information, personnel data, and other applications since the 1980s. Relational databases have often replaced legacy hierarchical databases and network databases, because RDBMS were easier to implement and administer. Nonetheless, relational stored data received continued, unsuccessful challenges by object database management systems in the 1980s and 1990s, (which were introduced in an attempt to address the so-called object–relational impedance mismatch between relational databases and object-oriented application programs), as well as by XML database management systems in the 1990s.[25] However, due to the expanse of technologies, such as horizontal scaling of computer clusters, NoSQL databases have recently become popular as an alternative to RDBMS databases.[26]

Distributed relational databases

[edit]

Distributed Relational Database Architecture (DRDA) was designed by a workgroup within IBM in the period 1988 to 1994. DRDA enables network connected relational databases to cooperate to fulfill SQL requests.[27][28] The messages, protocols, and structural components of DRDA are defined by the Distributed Data Management Architecture.

List of database engines

[edit]

According to DB-Engines, in December 2024 the most popular systems on the db-engines.com web site were:[29]

  1. Oracle Database
  2. MySQL
  3. Microsoft SQL Server
  4. PostgreSQL
  5. Snowflake
  6. IBM Db2
  7. SQLite
  8. Microsoft Access
  9. Databricks
  10. MariaDB

According to research company Gartner, in 2011, the five leading proprietary software relational database vendors by revenue were Oracle (48.8%), IBM (20.2%), Microsoft (17.0%), SAP including Sybase (4.6%), and Teradata (3.7%).[30]

See also

[edit]

References

[edit]

Sources

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A relational database is a type of database management system that organizes data into structured tables composed of rows and columns, where each row represents a record (or ) and each column represents an attribute, enabling the establishment of relationships between data points across tables through keys. This model ensures , consistency, and efficient retrieval by adhering to principles such as normalization to minimize and anomalies. The was first proposed by researcher in his seminal 1970 paper, "A Relational Model of Data for Large Shared Data Banks", which introduced the concept of representing data as mathematical relations to simplify querying and maintenance in large-scale systems. Codd's framework shifted away from earlier hierarchical and network models, emphasizing declarative querying over procedural navigation, which laid the groundwork for modern database technology. Key features of relational databases include the use of primary keys to uniquely identify rows within a table and foreign keys to link tables, enforcing and supporting complex joins for . They are typically managed by a relational database management system (RDBMS), such as traditional RDBMS like , , , PostgreSQL, and managed cloud services like Amazon RDS (AWS) and Google Cloud SQL, which provide tools for data definition, manipulation, and control. The adoption of relational databases accelerated in the late 1970s and 1980s, with IBM's System R serving as an influential prototype that demonstrated practical implementation, leading to the first commercial RDBMS releases, including in 1979. Today, relational databases remain foundational for applications requiring (Atomicity, Consistency, Isolation, Durability) compliance, such as in , , and , handling vast datasets while supporting through features like indexing and partitioning.

Overview

Definition and Principles

A relational database is a type of database management system that organizes into relations, which are tabular structures consisting of rows (tuples) and columns (attributes), adhering to the introduced by in 1970. This model represents as sets of relations where each relation captures entities and their associations through shared attributes, enabling efficient storage, retrieval, and manipulation without reliance on physical storage details or navigational paths. The foundational principles of relational databases emphasize , , and declarative querying. Logical ensures that changes to the , such as adding new relations, do not affect application programs, while physical shields users from alterations in storage structures or access methods. is enforced through constraints like primary keys, which uniquely identify each in a relation, and rules that maintain consistency across relations. Querying relies on set-based operations—such as selection, projection, and join—which treat data as mathematical sets, allowing users to specify what data is needed without detailing how to retrieve it. In contrast to earlier hierarchical and network models, which organize data in tree-like or graph structures requiring explicit navigation along predefined paths, the relational model uses a flat, tabular format that promotes simplicity and flexibility. Hierarchical models limit relationships to parent-child hierarchies, while network models (like ) allow more complex linkages but often lead to access dependencies and redundancy; the relational approach avoids these by providing a declarative, high-level interface that abstracts away implementation details.

Applications and Advantages

Relational databases find extensive application across diverse sectors due to their ability to manage structured data efficiently. In business environments, they underpin (CRM) and (ERP) systems, such as , which use them to handle structured processes like customer interactions, inventory tracking, and operations. Financial services leverage relational databases for secure , account management, and compliance reporting in banking and payment systems, where is paramount. Web applications, including platforms like those built with or , rely on them to store and retrieve user profiles, product catalogs, and order histories through relational links. Additionally, managed cloud services such as Amazon RDS and Google Cloud SQL offer relational database capabilities with simplified administration and support for engines including MySQL and PostgreSQL. In the no-code/low-code space, platforms like Airtable provide relational capabilities through linked records and a spreadsheet-like interface, enabling users to manage relational data without traditional programming. In scientific , relational databases organize experimental results, patient records, and metadata, as seen in healthcare studies where they enable consistent querying and analysis of structured datasets. A primary advantage of relational databases is their adherence to ACID properties—atomicity, consistency, isolation, and —which ensure reliable and predictable transaction handling, minimizing errors in critical operations like financial transfers. The use of standardized SQL provides portability, allowing queries and schemas to transfer seamlessly between systems, while supporting complex operations such as joins to relate data across tables and aggregations for analytical insights. For structured data, they offer through techniques like indexing, sharding, and vertical scaling, enabling growth in enterprise settings without compromising performance. Despite these strengths, relational databases have limitations when applied to certain data types; they excel with structured information but are less suitable for , such as or semi-structured formats, due to rigid schemas that require predefined structures. Similarly, in high-velocity environments involving real-time streams, their emphasis on compliance can introduce overhead, potentially slowing ingestion compared to more flexible alternatives. As of 2025, relational databases power the majority of enterprise applications, with the global market projected to reach $82.95 billion, reflecting their enduring dominance in structured data environments. Systems like and alone account for over 57% of developer usage in surveys, highlighting their widespread adoption.

History

Origins and Theoretical Foundations

The relational model originated from the work of , a mathematician at IBM's San Jose Research Laboratory, who began developing the concept in 1969 while addressing challenges in managing large-scale systems. In June 1970, Codd published his seminal paper, "A Relational Model of Data for Large Shared Data Banks," in Communications of the ACM, introducing a organization based on mathematical relations to enable shared access to extensive formatted banks without exposing users to underlying storage details. This work was motivated by the limitations of prevailing navigational database systems, such as hierarchical (tree-structured) and network models like IBM's IMS, which enforced rigid physical linkages, ordering, and access paths that made retrieval inflexible and dependent on specific program knowledge of the . Codd argued that these systems led to program failures when organization changed, as applications had to navigate predefined paths, resulting in high maintenance costs and inefficiency for large shared environments. Codd's model sought to overcome these issues through key theoretical motivations, including the elimination of to prevent inconsistencies and wasted storage, the assurance of so that changes in physical representation did not affect user queries or applications, and the representation of data using predicate logic for precise, declarative querying. By treating data as relations—n-ary sets of tuples—he proposed a "universal data sublanguage" grounded in first-order predicate calculus, allowing users to specify what data they needed rather than how to retrieve it, thus insulating applications from structural modifications. This approach addressed redundancy by defining where relations minimized derivable projections, ensuring that data could be reconstructed without duplication while maintaining logical consistency. The theoretical foundations drew heavily from mathematical disciplines, particularly for modeling relations as mathematical sets and for query formulation and integrity enforcement. Codd's innovations at built on these principles to create a framework that prioritized user protection from data organization details, stating that "future users of large data banks must be insulated from any changes in the structure of data which are made possible by improvements in base hardware and software technology." To further refine the criteria for true relational systems, Codd later proposed his 12 rules (often counted as 13, including Rule 0: the foundation rule) in 1985, outlining essential properties for a relational database , such as guaranteed access via logical addressing and support for view updating.

Commercial Development and Adoption

The development of commercial relational database management systems (RDBMS) began in the late 1970s, transitioning from research prototypes to market-ready products. IBM's System R, initiated in 1974 as an internal research project, served as a key prototype that demonstrated the feasibility of relational databases using SQL as the query language, influencing subsequent commercial efforts. In 1979, released the first commercially available RDBMS, initially known as Oracle Version 2, which ran on hardware and marked a pivotal shift toward enterprise adoption by offering structured query capabilities for business applications. IBM followed with SQL/DS in 1981, targeted at mainframe environments, and later DB2 in 1983, which became a cornerstone for large-scale in corporations. Standardization efforts solidified the relational model's commercial viability. In 1986, the (ANSI) published the first SQL standard (SQL-86, or ANSI X3.135), establishing a common syntax for querying relational databases and promoting across vendors. This was adopted internationally by the ISO in 1987 as SQL-87, with subsequent revisions—such as for enhanced integrity and SQL:1999 for object-relational features—evolving the standard to address growing complexities in , culminating in SQL:2023, which includes support for and property graphs. The 1980s saw an enterprise boom in RDBMS adoption, driven by the need for reliable data handling in sectors like finance and manufacturing, with products from , , and others powering systems. In the , open-source alternatives accelerated widespread use: was first released in 1995, gaining popularity for web applications due to its simplicity and performance, while emerged in 1996 from the academic Postgres project, offering advanced features like extensibility for complex queries. The 2000s integrated relational databases with , enabling scalable deployments through services like Amazon RDS (launched in 2009), which facilitated on-demand access and reduced infrastructure costs for businesses. By 2025, relational databases maintain dominance in the DBMS market, accounting for approximately 64% of revenue in 2023 according to industry analyses, underscoring their enduring role in handling structured data amid the rise of hybrid environments.

Relational Model

Core Concepts and Terminology

In the , a relation is defined as a of ordered n-s, where each consists of values drawn from specified domains, mathematically equivalent to a of the of those n domains. This structure is typically represented as a table, with no inherent ordering among the s or within the attributes, ensuring that the relation remains a set without duplicates. A corresponds to a single row in the relation, forming an n-tuple where the i-th component belongs to the i-th domain. Each attribute represents a column in the relation, named and associated with a specific domain that defines the allowable values for that position across all tuples. The degree of a relation is the number of attributes (n), while its cardinality is the number of distinct tuples it contains. The relational model distinguishes between the , which defines the logical structure including relations, attributes, and their domains, and the instance, which is the actual collection of tuples at any given time. Data storage relies on value-based representation, where relationships between data are established solely through shared attribute values rather than physical pointers, ordering, or hierarchical links, promoting . To handle missing or inapplicable information, E.F. Codd later extended the model to include null values, which represent either "value at present unknown" or "property inapplicable," distinct from empty strings or zeros, and integrated into a for queries.

Mathematical Basis

The relational model is grounded in and predicate logic, providing a formal framework for data representation and manipulation. At its core, a domain DiD_i is defined as a set of atomic values that can be assigned to attributes, ensuring type consistency across the database. A relation RR of degree nn is formally a of the Cartesian product of nn domains, expressed as RD1×D2××DnR \subseteq D_1 \times D_2 \times \cdots \times D_n, where each element of RR corresponds to a valid of values from these domains. A tuple in this model is a finite, ordered of nn values, with the ii-th value drawn from domain DiD_i, representing a single record or fact. The relation RR itself is a set of such tuples, inherently enforcing since sets do not permit duplicates, which eliminates at the mathematical level. Formally, a relation comprises a heading and a body: the heading specifies the attribute names paired with their respective domains (e.g., R(A1:D1,A2:D2,,An:Dn)R(A_1: D_1, A_2: D_2, \dots, A_n: D_n)), defining the structure, while the body is the finite set of tuples populating that structure at any given time. The foundation in predicate logic enables queries to be formulated as logical predicates applied over relations, allowing declarative expressions of and manipulation in terms of statements. This approach, rooted in an applied predicate , supports relational completeness, where any query expressible in can be represented within the model.

Data Organization

Relations, Tuples, and Attributes

In the , a relation is conceptualized as a table that organizes into rows and columns, where the columns represent attributes defining the characteristics of the stored entities, and the rows, known as tuples, capture instances or of those entities. This structure ensures that data is stored in a declarative manner, independent of physical implementation details, allowing users to interact with it through logical representations. To illustrate, consider a simple employee relation named Employees with three attributes: EmployeeID (an integer identifier), Name (a for the employee's full name), and Department (a indicating the ). Each in this relation would consist of a unique combination of values for these attributes, such as (101, "Alice Johnson", "Engineering"), representing one employee's details without implying any order among the tuples. This tabular format facilitates straightforward comprehension and manipulation of data relationships. Relations are categorized into base relations, which store the actual persistent data in the database (often called base tables in SQL implementations), and derived relations, such as views, which are virtual and computed dynamically from queries on base relations or other views without storing data separately. Base relations form the foundational storage, while derived ones provide flexible, on-demand perspectives of the data. Each attribute in a relation must hold only atomic (indivisible, simple) values from its defined domain, prohibiting nested structures like lists or sets within a single cell to maintain the model's simplicity and ensure compliance. This atomicity requirement, where domains briefly specify the allowable value types (e.g., integers or strings), supports efficient querying and integrity.

Domains and Schemas

In the , a domain represents the set of permissible atomic values from which the values of a specific attribute are drawn, ensuring consistency and across relations. This concept, introduced by E.F. Codd, defines domains as finite or infinite sets of values, such as the domain of integers for numeric attributes or strings for textual ones, preventing invalid entries like non-numeric values in an age field. For instance, the domain for an employee's age attribute might be restricted to integers between 18 and 65, limiting values to that range while excluding extraneous like negative numbers or decimals. A in a relational database outlines the structural blueprint, comprising relation schemas that specify the attributes of each table along with their associated domains, and the overall as the integrated collection of these relation schemas, including definitions for views, indexes, and constraints where applicable. Relation schemas thus serve as the foundational descriptors, naming the table and mapping each attribute to its domain, while the provides a holistic view of inter-table organization without delving into instances. This separation allows for abstract design independent of physical storage, facilitating maintenance and scalability in large systems. Modern relational database management systems (RDBMS) implement domains through type systems, offering built-in data types such as for whole numbers, for variable-length strings, and DATE for temporal values, which align with the abstract domains of the by enforcing value ranges and formats at the storage level. Users can extend these with user-defined domains, created via SQL statements like CREATE DOMAIN, which base a new type on an existing one while adding custom constraints, such as CHECK conditions to validate specific rules beyond standard types. For example, a user-defined domain for might build on with a precision of two places and a non-negative constraint, promoting reusability across attributes. Schema evolution addresses the need to modify these structures over time in response to changing requirements, involving operations like adding or dropping attributes, altering domains, or renaming relations, often managed through versioning to track historical states and automate migrations. In practice, tools and protocols enable forward and , allowing applications to query evolving schemas without , as demonstrated in industrial case studies where schema changes were applied incrementally to minimize in production environments. This process underscores the relational model's flexibility, though it requires careful planning to preserve during transitions.

Integrity Mechanisms

Keys and Relationships

In the relational model, keys are essential attributes or sets of attributes that ensure uniqueness within a relation and facilitate connections between relations. A is any set of one or more attributes that uniquely identifies each in a relation, allowing no two tuples to share the same values for that set. A is a minimal superkey, meaning no proper of its attributes is itself a superkey; multiple candidate keys may exist for a given relation, such as both employee ID and a combination of name and birthdate uniquely identifying an employee. The is the candidate key selected to serve as the unique identifier for tuples in the relation, with the choice often guided by factors like simplicity and stability; for instance, in an employee relation, employee ID might be chosen as the primary key over a composite of name and address. A is an attribute or set of attributes in one relation that matches the (or a ) of another relation, establishing a link between them without duplicating data. For example, in a department relation with department ID as the primary key, an employee relation might include department ID as a to indicate which department each employee belongs to. Foreign keys enable the to represent associations between entities while preserving through referential constraints, ensuring that referenced values exist in the target relation. Keys define the types of relationships between relations, which describe how tuples in one relation correspond to those in another. A one-to-one relationship occurs when each tuple in one relation is associated with at most one tuple in another, and vice versa; this can be implemented by placing the primary key of one relation as a foreign key in the other, often with mutual foreign keys or by merging relations if appropriate. For instance, a person relation might have a one-to-one link to a passport relation, where passport number serves as both primary and foreign key. A one-to-many relationship exists when each in one relation (the "one" side) can be associated with zero, one, or multiple tuples in another (the "many" side), but each tuple on the "many" side links to at most one on the "one" side; this is typically realized by placing a in the "many" relation that references the of the "one" relation. In a classic example, a department relation (one side) relates to an employee relation (many side), where employees' department IDs as point to the department's , allowing one department to have multiple employees but each employee to belong to only one department. A many-to-many relationship arises when tuples in one relation can associate with multiple tuples in another, and vice versa; direct implementation is avoided to prevent redundancy, instead using a (or associative) relation that contains foreign keys referencing the primary keys of both original relations, effectively decomposing the many-to-many into two one-to-many relationships. For example, a relation and a course relation might connect via an enrollment junction relation with student ID and course ID as foreign keys, capturing multiple enrollments per student and multiple students per course. This structure supports efficient querying and updates while maintaining normalization principles.
Key TypeDefinitionExample in Employee Relation
SuperkeySet of attributes uniquely identifying tuples (may include extras){EmployeeID, Name, Address}
Candidate KeyMinimal (no subset is a superkey){EmployeeID}, {SSN}
Primary KeySelected for unique identificationEmployeeID
Foreign KeyReferences of another relationDepartmentID (referencing Departments table)

Constraints and

In relational databases, constraints are rules enforced on data to maintain accuracy, consistency, and validity across relations. These mechanisms prevent invalid states by restricting operations that would violate predefined conditions, such as insertions, updates, or deletions that introduce inconsistencies. Entity integrity is a fundamental constraint ensuring that the of every in a relation is neither null nor contains duplicate values, thereby guaranteeing that each entity can be uniquely identified without ambiguity. This rule applies specifically to attributes, prohibiting nulls to uphold the relational model's requirement for identifiable records. Referential integrity maintains consistency between related relations by requiring that the value of a in one relation either matches an existing value in the referenced relation or is null, thus avoiding orphaned records or invalid references. Violations of this constraint occur during operations like deleting a referenced or updating a to an unmatched value. To handle such violations, database systems support actions including RESTRICT, which blocks the operation if it would break the reference; CASCADE, which propagates the delete or update to dependent s; SET NULL, which sets the to null; or SET DEFAULT, which assigns a default value, depending on the system's implementation. Check constraints enforce custom business rules on attribute values within a relation, such as ensuring an employee's age is greater than 18 or a exceeds a minimum threshold, by evaluating a during data modification. These constraints are declarative, specified at the table level, and apply to single or multiple columns, rejecting operations that fail the condition to preserve semantic correctness. Unique constraints extend beyond primary keys by ensuring that values in one or more columns are distinct across all tuples in a relation, allowing null values (unlike primary keys) to support alternate unique identifiers, such as email addresses in a user table. This prevents duplicates in non-primary attributes while permitting flexibility for optional requirements.

Querying and Manipulation

Relational Algebra Operations

provides a procedural framework for querying and manipulating relations in the , where each operation takes one or more relations as input and yields a new relation as output. Introduced by in 1970, it emphasizes set-theoretic foundations to ensure and structured manipulation. The operations are designed to be composable, forming a that maintains relational integrity throughout computations. The fundamental operations, often termed primitive, encompass selection, projection, union, set difference, , and rename. These primitives enable the expression of basic and combination tasks. Selection, symbolized as σ\sigma, filters tuples from a relation RR that satisfy a predicate PP, defined formally as: σP(R)={ttRP(t)}\sigma_P(R) = \{ t \mid t \in R \land P(t) \} where PP involves comparisons like equality or inequality and logical connectives. Projection, denoted Π\Pi, extracts specified attributes from RR while eliminating duplicates to ensure the result remains a relation, expressed as ΠA1,A2,,Ak(R)\Pi_{A_1, A_2, \dots, A_k}(R), with A1A_1 to AkA_k as the chosen attributes. Union, indicated by \cup, merges tuples from two type-compatible relations RR and SS (same arity and corresponding domains), yielding: RS={ttRtS}R \cup S = \{ t \mid t \in R \lor t \in S \} with duplicates removed. Set difference, using -, identifies tuples unique to RR relative to SS: RS={ttRtS}R - S = \{ t \mid t \in R \land t \notin S \} applicable only to compatible relations. Cartesian product, ×\times, generates all possible pairings of tuples from RR and SS: R×S={tqtRqS}R \times S = \{ tq \mid t \in R \land q \in S \} assuming attribute names are distinct or renamed if overlapping. Rename, ρ\rho, reassigns names to relations or attributes, such as ρT(R)\rho_{T}(R) to designate RR as TT, facilitating composition without name conflicts. Derived operations build upon the primitives to handle common relational tasks more directly, including join, intersection, and division. Natural join, \bowtie, links RR and SS on matching values of shared attributes, equivalent to a theta join (generalized condition) restricted to equality, and formally: RS=ΠX(σP(R×S))R \bowtie S = \Pi_X \left( \sigma_P (R \times S) \right) where PP enforces equality on common attributes and XX selects output attributes. Theta join extends this to arbitrary conditions in PP, such as inequalities. Intersection, \cap, retrieves shared tuples: RS={ttRtS}R \cap S = \{ t \mid t \in R \land t \in S \} derivable as R(RS)R - (R - S), requiring compatibility. Division, ÷\div, identifies attribute values in the projection of RR (excluding SS's attributes) that associate with every tuple in SS: R÷S={ttΠRS(R)uS(tuR)}R \div S = \{ t \mid t \in \Pi_{R - S}(R) \land \forall u \in S \, (tu \in R) \} useful for queries like "all parts supplied by every supplier." These operations exhibit closure: any composition results in a valid relation, enabling the construction of arbitrary query expressions through nesting and sequencing. This expressiveness allows relational algebra to represent all information-retrieval requests expressible in the model, serving as the theoretical core for languages like SQL.

SQL as the Standard Language

SQL, or Structured Query Language, emerged as the declarative language for managing and querying relational databases, providing a standardized interface that translates concepts into practical syntax for data operations. Developed in 1974 by and as SEQUEL (Structured English QUEry Language) for IBM's System R research project, it was designed to demonstrate the viability of Edgar F. Codd's in a prototype database system. The language was later shortened to SQL due to trademark issues and evolved through System R's phases, unifying data definition, manipulation, and view mechanisms by 1976. This foundation enabled SQL to become the , influencing commercial systems like IBM's SQL/DS and DB2. SQL is categorized into sublanguages that handle distinct aspects of database interaction. (DDL) includes commands like CREATE and ALTER to define and modify database structures such as tables and schemas, while DROP removes them. (DML) encompasses SELECT for querying data, INSERT for adding rows, UPDATE for modifying existing data, and DELETE for removing rows. (DCL) manages access with GRANT to assign privileges and REVOKE to withdraw them, ensuring security over database objects. At the heart of SQL lies the SELECT statement, which retrieves from one or more tables using a structured that supports complex filtering and aggregation. The basic form is SELECT column_list FROM table_list [WHERE condition] [GROUP BY columns] [HAVING condition] [ORDER BY columns];, where FROM specifies the source tables, WHERE filters rows before grouping, GROUP BY aggregates into groups, HAVING applies conditions to groups, and ORDER BY sorts the results. Joins, such as INNER JOIN or LEFT JOIN, combine rows from multiple tables based on related columns, while subqueries—nested SELECT statements—allow embedding queries within clauses like WHERE or FROM for advanced filtering, such as selecting employees with salaries above the departmental average. SQL's standardization began with ANSI's adoption as X3.135 in 1986, followed by ISO as 9075 in 1987, establishing core syntax and semantics across implementations. The ISO/IEC 9075 standard, now in its 2023 edition, comprises nine parts, including SQL/Foundation for core language elements and optional modules like SQL/ for document handling; it defines conformance levels such as Core (mandatory features) and (vendor extensions). Over time, SQL has evolved from SQL-86's basic relational operations to SQL:1999's introduction of Common Table Expressions (CTEs) for readable subquery reuse and window functions for analytics like ROW_NUMBER() over ordered partitions without collapsing rows. SQL:2016 added support for storing and querying , while SQL:2023 enhances this with native types, scalar functions, and simplified accessors like dot notation for nested objects, alongside improvements to recursive CTEs for handling cycles in hierarchical data. These advancements support modern analytics workloads while maintaining .

Database Design

Normalization Process

The normalization process in relational databases involves systematically decomposing relations into smaller, well-structured components to eliminate data redundancies and dependency anomalies while preserving the information content of the original database. This step-by-step refinement ensures that the database schema adheres to progressively stricter normal forms, based on constraints known as functional dependencies. The goal is to design a schema that minimizes update, insertion, and deletion anomalies, thereby improving data integrity and consistency. Functional dependencies form the foundational constraints in this process. A functional dependency (FD) exists in a relation R when one set of attributes X functionally determines another set Y, denoted as X → Y, meaning that for any two tuples in R that agree on X, they must also agree on Y. This concept was introduced as part of the to capture semantic relationships between attributes. FDs help identify potential redundancies, such as when non-key attributes depend on only part of a , leading to anomalies during data modifications. To infer all implied FDs from a given set, Armstrong's axioms provide a complete set of inference rules. These axioms, developed by William W. Armstrong, include three primary rules: reflexivity, augmentation, and transitivity. Reflexivity states that if Y is a of X, then X → Y holds trivially. Augmentation asserts that if X → Y, then for any Z, XZ → YZ. Transitivity implies that if X → Y and Y → Z, then X → Z. Additional derived rules, such as union and , can be proven from these basics, ensuring and completeness for FD . The normalization process progresses through a series of normal forms, each building on the previous to address specific types of dependencies. First Normal Form (1NF) requires that all attributes in a relation contain atomic (indivisible) values, eliminating repeating groups or multivalued attributes within tuples. This ensures the relation resembles a mathematical table with no nested structures. Second Normal Form (2NF) extends 1NF by requiring that no non-prime attribute (one not part of any candidate key) is partially dependent on any candidate key. In other words, every non-key attribute must depend on the entire candidate key, not just a portion of it. This eliminates partial dependencies, which can cause update anomalies in relations with composite keys. Third Normal Form (3NF) further refines 2NF by prohibiting transitive dependencies, where a non-prime attribute depends on another non-prime attribute rather than directly on a candidate key. A relation is in 3NF if, for every FD X → Y, either X is a superkey or each attribute in Y - X is prime. These forms were formalized to free relations from insertion, update, and deletion dependencies. Boyce-Codd Normal Form (BCNF) imposes a stricter condition than 3NF: for every non-trivial FD X → Y in the relation, X must be a (a minimal ). This addresses cases where 3NF allows that are not superkeys, potentially leading to anomalies in relations with overlapping candidate keys. BCNF ensures every is a , making it particularly useful for eliminating certain redundancy issues not resolved by 3NF. Higher normal forms target more complex dependencies. (4NF) deals with multivalued dependencies (MVDs), where an attribute set is independent of another but both depend on a common key. A relation is in 4NF if it is in BCNF and has no non-trivial MVDs other than those implied by FDs. This prevents redundancy from independent multivalued facts, such as multiple hobbies per unrelated to skills. MVDs generalize FDs and were defined to capture such scenarios. (5NF), also known as Project-Join Normal Form (PJ/NF), addresses join dependencies, where a relation can be decomposed into projections that can be rejoined without spurious tuples. A relation is in 5NF if it is in 4NF and every join dependency is implied by the candidate keys. This form eliminates anomalies from cyclic dependencies across multiple relations. In practice, the normalization process begins by identifying all relevant FDs (and higher dependencies for advanced forms) using domain knowledge and Armstrong's axioms to compute closures. The schema is then decomposed iteratively: for violations of a target normal form, select an offending FD X → Y, project the relation into R1 = (X Y) and R2 = (attributes of R - Y), and replace the original with these projections. Decompositions must be lossless—meaning the natural join of the projections equals the original relation without spurious tuples—to preserve data. This is verified if the FDs include a condition where one projection's key is contained in the other. The process continues until the schema satisfies the desired normal form, balancing integrity with query efficiency.

Denormalization and Performance Considerations

Denormalization involves intentionally introducing into a relational database that has been normalized to higher normal forms, such as (3NF), to enhance query performance at the expense of storage efficiency and data consistency maintenance. This technique counters the strict elimination of in normalization by selectively duplicating data, thereby reducing the computational overhead of joins and aggregations during read operations. Common denormalization strategies include creating pre-joined tables, where data from multiple normalized tables is combined into a single table to eliminate runtime joins for frequently queried combinations. For example, in an system, customer and order details might be merged into one table to speed up retrieval of order histories. Another approach is implementing aggregates, which precompute and store results of common aggregation functions like sums or averages, avoiding repeated calculations on large datasets. This is particularly useful for reporting queries involving totals, such as monthly sales figures stored directly in a denormalized table. Clustering, as a , groups related records physically or logically within tables to minimize data scattering, facilitating faster scans and range queries without relying solely on indexes. The primary trade-offs of denormalization center on improved read performance versus increased risks of update anomalies and higher storage costs. By duplicating data, queries can execute faster—often reducing response times by orders of magnitude for join-heavy operations—but updates require propagating changes across redundant copies, potentially leading to inconsistencies if not managed carefully. Storage overhead rises due to redundancy, which can be significant in large-scale systems, though this is offset in read-intensive environments where query speed is paramount. Denormalization is most appropriate for high-read workloads, such as analytical reporting or (OLAP) systems, where complex queries dominate over frequent updates typical in (OLTP). In OLAP scenarios, denormalized schemas support multidimensional analysis by flattening hierarchies, enabling sub-second responses on terabyte-scale data. Conversely, OLTP environments, focused on concurrent transactions, generally avoid extensive denormalization to preserve during writes. Modern relational database management systems (RDBMS) provide materialized views as a controlled mechanism for , storing precomputed query results that can be refreshed periodically or incrementally. These views act as virtual denormalized tables, combining the benefits of for fast reads with automated to mitigate update anomalies. For instance, Oracle's materialized views support equi-joins and aggregations optimized for warehousing, reducing query times while integrating with the underlying normalized . This approach, rooted in incremental view techniques, balances performance gains with consistency in production environments.

Advanced Features

Transactions and ACID Properties

In relational databases, a transaction is defined as a logical consisting of a sequence of operations, such as reads and writes, that are executed as a single, indivisible entity to maintain . Transactions typically begin with a BEGIN statement, proceed through a series of database operations, and conclude with either a COMMIT to permanently apply the changes or a to undo them entirely, ensuring that partial failures do not leave the database in an inconsistent state. This mechanism allows complex operations, like transferring funds between accounts, to be treated atomically, preventing issues such as overdrafts if one step fails. The reliability of transactions in relational databases is ensured through the ACID properties (atomicity, consistency, isolation, durability), a set of guarantees that ensure reliable ; the acronym was coined by Theo Härder and Andreas Reuter in 1983. Atomicity requires that a transaction is executed completely or not at all; if any operation fails, the entire transaction is rolled back, restoring the database to its pre-transaction state. Consistency mandates that a transaction brings the database from one valid state to another, preserving integrity constraints such as primary keys, foreign keys, and check constraints after completion. Isolation ensures that concurrent transactions do not interfere with each other, making each transaction appear to execute in isolation even when running in parallel. Durability guarantees that once a transaction is committed, its effects are permanently stored, surviving subsequent system failures through techniques like . To balance isolation with performance in multi-user environments, relational databases implement varying isolation levels as defined by the ANSI SQL standard, which specify the degree to which concurrent transactions are shielded from each other's effects. The read uncommitted level allows a transaction to read data modified by another uncommitted transaction, potentially leading to dirty reads but maximizing concurrency. Read committed prevents dirty reads by ensuring reads only from committed data, though it permits non-repeatable reads where the same query may yield different results within a transaction. Repeatable read avoids non-repeatable reads by locking read data until the transaction ends, but it may still allow phantom reads from new insertions by other transactions. The strictest, serializable, fully emulates sequential execution, preventing all anomalies including phantoms through techniques like locking or timestamping, at the cost of reduced concurrency. For distributed relational databases spanning multiple nodes, the two-phase commit (2PC) protocol coordinates transactions to achieve atomicity and consistency across sites. In the first phase, a coordinator polls participants to prepare the transaction; each votes yes if it can commit locally or no if it cannot, with all logging their intent durably. If all vote yes, the second phase issues a global commit, propagating the decision; otherwise, an abort is sent, and all roll back. This ensures that either all sites commit or none do, though it can block if the coordinator fails, requiring recovery mechanisms.

Stored Procedures, Triggers, and Views

Stored procedures are pre-compiled blocks of SQL code stored in the database that can be invoked repeatedly to perform complex operations, such as data manipulation or execution, often with input and output parameters for flexibility. They originated as an extension to SQL in commercial RDBMS implementations, with introducing stored procedures in Oracle7 in 1992 to enhance reusability and reduce network traffic by executing code server-side. Stored procedures support error handling through exception blocks and can include conditional logic, making them suitable for encapsulating database-side programming. A basic example of creating a in , Oracle's procedural extension to SQL, is as follows:

sql

CREATE OR REPLACE PROCEDURE update_employee_salary(emp_id IN NUMBER, raise_pct IN NUMBER) IS BEGIN UPDATE employees SET salary = salary * (1 + raise_pct / 100) WHERE employee_id = emp_id; IF SQL%ROWCOUNT = 0 THEN RAISE_APPLICATION_ERROR(-20001, 'Employee not found'); END IF; COMMIT; EXCEPTION WHEN OTHERS THEN ROLLBACK; RAISE; END update_employee_salary;

CREATE OR REPLACE PROCEDURE update_employee_salary(emp_id IN NUMBER, raise_pct IN NUMBER) IS BEGIN UPDATE employees SET salary = salary * (1 + raise_pct / 100) WHERE employee_id = emp_id; IF SQL%ROWCOUNT = 0 THEN RAISE_APPLICATION_ERROR(-20001, 'Employee not found'); END IF; COMMIT; EXCEPTION WHEN OTHERS THEN ROLLBACK; RAISE; END update_employee_salary;

This procedure updates an employee's salary by a percentage and includes error handling if no rows are affected. In , (T-SQL) provides similar functionality, allowing procedures to accept parameters and manage transactions internally. Triggers are special types of stored procedures that automatically execute in response to specific database events, such as INSERT, UPDATE, or DELETE operations on a table or view, enabling automation of tasks like or auditing. They were introduced alongside stored procedures in early RDBMS to enforce rules implicitly without application-level code, with supporting them since version 7. DML triggers, the most common type, fire for each affected row (row-level) or once per statement (statement-level), and can access special variables like OLD and NEW to reference pre- and post-event data. For instance, a T-SQL trigger in SQL Server for audit logging on an UPDATE event might look like this:

sql

CREATE TRIGGER tr_employees_audit ON employees AFTER UPDATE AS BEGIN INSERT INTO audit_log (table_name, operation, changed_at) SELECT 'employees', 'UPDATE', GETDATE() WHERE @@ROWCOUNT > 0; END;

CREATE TRIGGER tr_employees_audit ON employees AFTER UPDATE AS BEGIN INSERT INTO audit_log (table_name, operation, changed_at) SELECT 'employees', 'UPDATE', GETDATE() WHERE @@ROWCOUNT > 0; END;

This trigger logs updates to an audit table automatically after the operation completes. Triggers promote data integrity by responding immediately to changes, though they require careful design to avoid recursive firing or performance issues. Views serve as virtual tables derived from one or more base tables via a stored query, providing a simplified or restricted perspective of the underlying data without storing it physically, which aids in and . Introduced in the original SQL standard (ANSI X3.135-1986), views hide complex joins or sensitive columns, enabling row-level by limiting access to subsets of data based on user privileges. They can be updatable if based on a single table with no aggregates, allowing modifications that propagate to the base tables. An example of creating a view in standard SQL, compatible with systems like PostgreSQL, is:

sql

CREATE VIEW active_employees AS SELECT employee_id, first_name, last_name, department FROM employees WHERE status = 'active';

CREATE VIEW active_employees AS SELECT employee_id, first_name, last_name, department FROM employees WHERE status = 'active';

Querying this view (SELECT * FROM active_employees) returns only current employees, abstracting the full table and enforcing access controls. Views thus facilitate modular database design by decoupling applications from physical schema changes.

Implementation

RDBMS Architecture

The ANSI/ three-schema architecture provides a foundational framework for relational database management systems (RDBMS), dividing the database into three abstraction levels to promote and modularity. The external level consists of multiple user views, each tailored to specific applications or end-users, presenting only relevant portions of the data in a customized format without exposing the underlying structure. The conceptual level defines the , encompassing the entire database's entities, attributes, relationships, and constraints in a community-wide model independent of physical storage details. The internal level handles the physical , specifying how data is stored on disk, including file organizations and access paths optimized for efficiency. Mappings between these levels ensure , allowing modifications at one level without impacting others. The external/conceptual mapping translates user views to the , supporting logical data independence by enabling view changes without altering the conceptual model. Similarly, the conceptual/internal mapping converts the to physical storage, providing physical data independence so storage optimizations can occur without affecting higher levels. This separation enhances system flexibility, as changes in user requirements or hardware can be isolated. Core RDBMS components operate across these levels to manage queries and storage. The query processor handles SQL statement processing, comprising a parser that validates syntax and semantics, an optimizer that generates efficient execution plans using techniques like cost-based selection, and an that runs the plan via iterator-based operators to retrieve and manipulate . The storage manager oversees persistence and access, including subcomponents like the buffer manager, which controls transfers between disk and main memory using a shared buffer pool with replacement policies such as LRU-2 to minimize I/O operations. It also incorporates a transaction manager to enforce properties through locking protocols like and for concurrency and recovery. At the internal level, relations are stored using specific file structures to balance access efficiency and storage overhead. Heap files organize records in insertion order without sorting, suiting full scans or append-heavy workloads by allowing fast inserts at the file end. Sorted files maintain records in key order, facilitating range queries and equality searches through binary search, though inserts require costly maintenance to preserve ordering. Hashed files employ a on a key to distribute records across buckets, enabling O(1) average-case lookups for equality selections at the expense of range query support.

Indexing and Optimization Techniques

Indexing in relational database management systems (RDBMS) enhances query performance by providing efficient access structures, reducing the need for full table scans on large datasets. Indexes organize in a way that allows the query executor to locate and retrieve specific rows quickly, often at logarithmic . Common index types include B-trees, hash indexes, and indexes, each suited to different query patterns and characteristics. B-tree indexes, introduced as a balanced for maintaining ordered data, serve as the default for most equality and range queries in RDBMS. They consist of internal nodes pointing to child nodes or leaf nodes containing key-value pairs, ensuring balanced height for O(log n) search, insertion, and deletion operations. B-trees excel in scenarios requiring sorted access, such as ORDER BY clauses or range conditions like salary BETWEEN 50000 AND 80000. Hash indexes, designed for exact-match lookups, use a to map keys to in an array-like structure, enabling constant-time O(1) average-case access for equality predicates. They are particularly effective for lookups in point queries but less suitable for range scans due to the unordered nature of hash . Extendible hashing variants address bucket overflows dynamically, making them adaptable to varying data volumes in relational systems. indexes are optimized for attributes with low , where the number of distinct values is small relative to the row count, such as or status flags. Each distinct value is represented by a —a bit vector of equal to the table's row count—where a '1' indicates the presence of that value in a row. This structure supports fast bitwise operations for conjunctive and disjunctive queries, reducing I/O for selective predicates on low-cardinality columns. Query optimization in RDBMS involves selecting the most efficient execution strategy from multiple possible plans, balancing factors like CPU, I/O, and memory costs. Cost-based optimization, pioneered in System R, estimates plan costs using statistics on table sizes, index selectivity, and cardinalities to generate dynamic programming tables for evaluating join methods and access paths. This approach outperforms rule-based methods by adapting to data distribution, though it requires accurate statistics for reliable estimates. Heuristic rules complement cost-based techniques by applying predefined transformations to the search space early, such as pushing selections before joins or projecting only needed columns to minimize intermediate result sizes. These rules, like performing restrictions as early as possible, ensure efficient plan even for complex queries, reducing optimization time from exponential to in many cases. Execution plans detail the sequence of operations for query processing, including access methods and join strategies. Index scans traverse only relevant index portions for selective queries, contrasting with table scans that read entire tables, which are preferable for low-selectivity conditions where index overhead exceeds benefits. Join order determination, often via dynamic programming, minimizes intermediate result sizes by joining smaller relations first, with bushy trees allowing parallel evaluation in modern optimizers. Caching mechanisms, such as buffer pools, mitigate disk I/O by holding frequently accessed pages in memory, managed via least-recently-used (LRU) replacement policies to prioritize "hot" data. Buffer pools allocate fixed memory regions to cache table and index pages, enabling sub-millisecond access for repeated queries on working sets smaller than available RAM. Constraints like primary keys can influence index usage in caching, as they enforce uniqueness and accelerate lookups. Recent advancements incorporate AI-driven auto-tuning for query optimization, leveraging to refine execution plans and parameters dynamically. In extensions like Balsa and LEON, models analyze historical query patterns to suggest index configurations and join strategies, achieving up to 2-5x speedup on workloads with variable selectivity without manual intervention. These techniques represent a shift toward adaptive, self-optimizing RDBMS.

Modern Extensions

Distributed Relational Databases

Distributed relational databases extend traditional relational database management systems (RDBMS) by partitioning and replicating data across multiple nodes to handle increased load, ensure , and manage large-scale data volumes. This approach maintains relational integrity and SQL compatibility while addressing limitations of single-node systems. Key mechanisms include sharding for data distribution and replication for redundancy, often combined to balance performance and reliability. Horizontal sharding, also known as horizontal partitioning, divides a database table into smaller subsets called , typically based on a shard key such as a user ID or geographic region, with each shard stored on a separate node. This strategy enables parallel processing of queries and scales write throughput by localizing operations to specific nodes. For instance, range-based sharding assigns contiguous key ranges to shards, while hash-based sharding distributes keys evenly using a to minimize hotspots. Replication complements sharding by maintaining multiple copies of data across nodes to enhance read performance and . In master-slave replication, a single master node handles all writes, propagating changes asynchronously or synchronously to slave nodes that serve read queries, reducing load on the master and providing options. Multi-master replication allows writes on multiple nodes, synchronizing changes among them, which supports higher write but introduces complexity in , often using last-write-wins or versioning schemes. , for example, supports both via its replication framework, where master-slave setups are common for read scaling, and group replication enables multi-master configurations. Consistency models in distributed relational databases trade off between , where all nodes reflect the latest committed data, and , where updates propagate over time. The , formalized by Gilbert and Lynch, posits that in the presence of network partitions (P), a system can prioritize either consistency (C) or (A), but not both. Relational databases traditionally favor CP systems, ensuring transactions across nodes via protocols like two-phase commit (2PC), but this may sacrifice during partitions. Some modern implementations relax to AP models for better , accepting temporary inconsistencies resolved through . Brewer's original highlighted these trade-offs in distributed . Distributed joins pose significant challenges due to data locality, requiring data movement across nodes via techniques like broadcast, redistribution, or semi-joins, which incur high network and CPU overhead. For example, joining tables sharded on different keys may necessitate shipping entire partitions, exacerbating latency in large clusters. Optimization strategies include co-partitioning related tables on the same key to localize joins. The ensures atomicity in distributed transactions by coordinating a prepare phase and a commit phase across nodes, but its overhead—from multiple message rounds and blocking—can bottleneck performance, especially under high contention or failures. Extensions like presumed abort reduce this by assuming aborts on timeouts, minimizing coordinator involvement. Distributed transactions extend single-node properties using such protocols, though at increased cost. Middleware solutions like Vitess address these issues for MySQL-based systems by providing transparent sharding, query , and connection pooling across shards, allowing applications to interact with a unified database view while handling resharding without . Vitess uses a keyspace-shard model, where VSchema defines rules, enabling efficient distributed operations.

Cloud-Native and NewSQL Systems

Cloud-native relational database management systems (RDBMS) are designed specifically for cloud environments, leveraging , , and to provide seamless and management without underlying infrastructure concerns. These systems, such as managed services for traditional RDBMS including and —for example, (including ) and —enable automatic resource provisioning and across distributed cloud regions. , a fully managed service compatible with and , employs a cloud-native storage that separates compute from storage, allowing serverless scaling where capacity adjusts dynamically based on workload demands, achieving up to five times the throughput of standard instances. Similarly, offers managed instances for , , and SQL Server, with built-in automation for backups, patching, and replication, ensuring 99.99% availability through multi-zone deployments and pay-per-use pricing models. NewSQL databases extend relational principles to distributed environments while preserving compliance, addressing scalability limitations of traditional RDBMS in cloud settings. , a PostgreSQL-compatible system, distributes data across clusters using a key-value store foundation, supporting horizontal scaling and geo-partitioning for global applications without sacrificing transactional consistency. , MySQL-compatible, employs a hybrid architecture combining SQL processing with a distributed key-value backend, enabling elastic scaling to petabyte levels while maintaining via consensus. These systems build on distributed strategies to handle massive concurrency, making them suitable for cloud-native workloads like and real-time . Recent developments from 2024 to 2025 have integrated into relational databases for enhanced automation, particularly in query optimization. Autonomous Database incorporates Select AI, allowing prompts to generate and explain SQL queries, while algorithms automatically tune performance by adjusting indexes and resource allocation in real time. Additionally, hybrid SQL/ features have emerged in cloud systems, such as vector search capabilities in and , enabling unified handling of structured relational data alongside unstructured elements for AI-driven applications. Key advantages of cloud-native and systems include auto-scaling to match demand, reducing over-provisioning, and pay-per-use billing that aligns costs with actual usage, potentially lowering expenses by up to 50% compared to on-premises setups. Market adoption has accelerated, with services projected to reach $23.84 billion in 2025, representing about 30% of organizations operating in fully cloud-native modes and driving overall relational database deployments toward greater reliance.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.