Recent from talks
Nothing was collected or created yet.
Relational database
View on WikipediaThis article needs additional citations for verification. (January 2024) |
A relational database (RDB[1]) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.[2]
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured format using rows and columns.
Many relational database systems are equipped with the option of using SQL (Structured Query Language) for querying and updating the database.[3]
History
[edit]The concept of relational database was defined by E. F. Codd at IBM in 1970. Codd introduced the term relational in his research paper "A Relational Model of Data for Large Shared Data Banks".[2] In this paper and later papers, he defined what he meant by relation. One well-known definition of what constitutes a relational database system is composed of Codd's 12 rules.
However, no commercial implementations of the relational model conform to all of Codd's rules,[4] so the term has gradually come to describe a broader class of database systems, which at a minimum:
- Present the data to the user as relations (a presentation in tabular form, i.e. as a collection of tables with each table consisting of a set of rows and columns);
- Provide relational operators to manipulate the data in tabular form.
In 1974, IBM began developing System R, a research project to develop a prototype RDBMS.[5][6] The first system sold as an RDBMS was Multics Relational Data Store (June 1976).[7][8][citation needed] Oracle was released in 1979 by Relational Software, now Oracle Corporation.[9] Ingres and IBM BS12 followed. Other examples of an RDBMS include IBM Db2, SAP Sybase ASE, and Informix. In 1984, the first RDBMS for Macintosh began being developed, code-named Silver Surfer, and was released in 1987 as 4th Dimension and known today as 4D.[10]
The first systems that were relatively faithful implementations of the relational model were from:
- University of Michigan – Micro DBMS (1969)[11]
- Massachusetts Institute of Technology (1971)[12]
- IBM UK Scientific Centre at Peterlee – IS1 (1970–72),[13] and its successor, PRTV (1973–79).[14]
The most common definition of an RDBMS is a product that presents a view of data as a collection of rows and columns, even if it is not based strictly upon relational theory. By this definition, RDBMS products typically implement some but not all of Codd's 12 rules.
A second school of thought argues that if a database does not implement all of Codd's rules (or the current understanding on the relational model, as expressed by Christopher J. Date, Hugh Darwen and others), it is not relational. This view, shared by many theorists and other strict adherents to Codd's principles, would disqualify most DBMSs as not relational. For clarification, they often refer to some RDBMSs as truly-relational database management systems (TRDBMS), naming others pseudo-relational database management systems (PRDBMS).[citation needed]
As of 2009, most commercial relational DBMSs employ SQL as their query language.[15]
Alternative query languages have been proposed and implemented, notably the pre-1996 implementation of Ingres QUEL.
Relational model
[edit]A relational model organizes data into one or more tables (or "relations") of columns and rows, with a unique key identifying each row. Rows are also called records or tuples.[16] Columns are also called attributes. Generally, each table/relation represents one "entity type" (such as customer or product). The rows represent instances of that type of entity (such as "Lee" or "chair") and the columns represent values attributed to that instance (such as address or price).
For example, each row of a class table corresponds to a class, and a class corresponds to multiple students, so the relationship between the class table and the student table is "one to many"[17]
Keys
[edit]Each row in a table has its own unique key. Rows in a table can be linked to rows in other tables by adding a column for the unique key of the linked row (such columns are known as foreign keys). Codd showed that data relationships of arbitrary complexity can be represented by a simple set of concepts.[2]
Part of this processing involves consistently being able to select or modify one and only one row in a table. Therefore, most physical implementations have a unique primary key (PK) for each row in a table. When a new row is written to the table, a new unique value for the primary key is generated; this is the key that the system uses primarily for accessing the table. System performance is optimized for PKs. Other, more natural keys may also be identified and defined as alternate keys (AK). Often several columns are needed to form an AK (this is one reason why a single integer column is usually made the PK). Both PKs and AKs have the ability to uniquely identify a row within a table. Additional technology may be applied to ensure a unique ID across the world, a globally unique identifier, when there are broader system requirements.
The primary keys within a database are used to define the relationships among the tables. When a PK migrates to another table, it becomes a foreign key (FK) in the other table. When each cell can contain only one value and the PK migrates into a regular entity table, this design pattern can represent either a one-to-one or one-to-many relationship. Most relational database designs resolve many-to-many relationships by creating an additional table that contains the PKs from both of the other entity tables – the relationship becomes an entity; the resolution table is then named appropriately and the two FKs are combined to form a PK. The migration of PKs to other tables is the second major reason why system-assigned integers are used normally as PKs; there is usually neither efficiency nor clarity in migrating a bunch of other types of columns.
Relationships
[edit]Relationships are a logical connection between different tables (entities), established on the basis of interaction among these tables. These relationships can be modelled as an entity-relationship model.
Transactions
[edit]In order for a database management system (DBMS) to operate efficiently and accurately, it must use ACID transactions.[18][19][20]
Stored procedures
[edit]Part of the programming within a RDBMS is accomplished using stored procedures (SPs). Often procedures can be used to greatly reduce the amount of information transferred within and outside of a system. For increased security, the system design may grant access to only the stored procedures and not directly to the tables. Fundamental stored procedures contain the logic needed to insert new and update existing data. More complex procedures may be written to implement additional rules and logic related to processing or selecting the data.
Terminology
[edit]
The relational database was first defined in June 1970 by Edgar Codd, of IBM's San Jose Research Laboratory.[2] Codd's view of what qualifies as an RDBMS is summarized in Codd's 12 rules. A relational database has become the predominant type of database. Other models besides the relational model include the hierarchical database model and the network model.
The table below summarizes some of the most important relational database terms and the corresponding SQL term:
| SQL term | Relational database term | Description |
|---|---|---|
| Row | Tuple or record | A data set representing a single item |
| Column | Attribute or field | A labeled element of a tuple, e.g. "Address" or "Date of birth" |
| Table | Relation or Base relvar | A set of tuples sharing the same attributes; a set of columns and rows |
| View or result set | Derived relvar | Any set of tuples; a data report from the RDBMS in response to a query |
Relations or tables
[edit]In a relational database, a relation is a set of tuples that have the same attributes. A tuple usually represents an object and information about that object. Objects are typically physical objects or concepts. A relation is usually described as a table, which is organized into rows and columns. All the data referenced by an attribute are in the same domain and conform to the same constraints.
The relational model specifies that the tuples of a relation have no specific order and that the tuples, in turn, impose no order on the attributes. Applications access data by specifying queries, which use operations such as select to identify tuples, project to identify attributes, and join to combine relations. Relations can be modified using the insert, delete, and update operators. New tuples can supply explicit values or be derived from a query. Similarly, queries identify tuples for updating or deleting.
Tuples by definition are unique. If the tuple contains a candidate or primary key then obviously it is unique; however, a primary key need not be defined for a row or record to be a tuple. The definition of a tuple requires that it be unique, but does not require a primary key to be defined. Because a tuple is unique, its attributes by definition constitute a superkey.
Base and derived relations
[edit]All data are stored and accessed via relations. Relations that store data are called "base relations", and in implementations are called "tables". Other relations do not store data, but are computed by applying relational operations to other relations. These relations are sometimes called "derived relations". In implementations these are called "views" or "queries". Derived relations are convenient in that they act as a single relation, even though they may grab information from several relations. Also, derived relations can be used as an abstraction layer.
Domain
[edit]A domain describes the set of possible values for a given attribute, and can be considered a constraint on the value of the attribute. Mathematically, attaching a domain to an attribute means that any value for the attribute must be an element of the specified set. The character string "ABC", for instance, is not in the integer domain, but the integer value 123 is. Another example of domain describes the possible values for the field "CoinFace" as ("Heads","Tails"). So, the field "CoinFace" will not accept input values like (0,1) or (H,T).
Constraints
[edit]Constraints are often used to make it possible to further restrict the domain of an attribute. For instance, a constraint can restrict a given integer attribute to values between 1 and 10. Constraints provide one method of implementing business rules in the database and support subsequent data use within the application layer. SQL implements constraint functionality in the form of check constraints. Constraints restrict the data that can be stored in relations. These are usually defined using expressions that result in a Boolean value, indicating whether or not the data satisfies the constraint. Constraints can apply to single attributes, to a tuple (restricting combinations of attributes) or to an entire relation. Since every attribute has an associated domain, there are constraints (domain constraints). The two principal rules for the relational model are known as entity integrity and referential integrity.
Primary key
[edit]Every relation/table has a primary key, this being a consequence of a relation being a set.[21] A primary key uniquely specifies a tuple within a table. While natural attributes (attributes used to describe the data being entered) are sometimes good primary keys, surrogate keys are often used instead. A surrogate key is an artificial attribute assigned to an object which uniquely identifies it (for instance, in a table of information about students at a school they might all be assigned a student ID in order to differentiate them). The surrogate key has no intrinsic (inherent) meaning, but rather is useful through its ability to uniquely identify a tuple. Another common occurrence, especially in regard to N:M cardinality is the composite key. A composite key is a key made up of two or more attributes within a table that (together) uniquely identify a record.[22]
Foreign key
[edit]Foreign key refers to a field in a relational table that matches the primary key column of another table. It relates the two keys. Foreign keys need not have unique values in the referencing relation. A foreign key can be used to cross-reference tables, and it effectively uses the values of attributes in the referenced relation to restrict the domain of one or more attributes in the referencing relation. The concept is described formally as: "For all tuples in the referencing relation projected over the referencing attributes, there must exist a tuple in the referenced relation projected over those same attributes such that the values in each of the referencing attributes match the corresponding values in the referenced attributes."
Stored procedures
[edit]A stored procedure is executable code that is associated with, and generally stored in, the database. Stored procedures usually collect and customize common operations, like inserting a tuple into a relation, gathering statistical information about usage patterns, or encapsulating complex business logic and calculations. Frequently they are used as an application programming interface (API) for security or simplicity. Implementations of stored procedures on SQL RDBMS's often allow developers to take advantage of procedural extensions (often vendor-specific) to the standard declarative SQL syntax. Stored procedures are not part of the relational database model, but all commercial implementations include them.
Index
[edit]An index is one way of providing quicker access to data. Indices can be created on any combination of attributes on a relation. Queries that filter using those attributes can find matching tuples directly using the index (similar to Hash table lookup), without having to check each tuple in turn. This is analogous to using the index of a book to go directly to the page on which the information you are looking for is found, so that you do not have to read the entire book to find what you are looking for. Relational databases typically supply multiple indexing techniques, each of which is optimal for some combination of data distribution, relation size, and typical access pattern. Indices are usually implemented via B+ trees, R-trees, and bitmaps. Indices are usually not considered part of the database, as they are considered an implementation detail, though indices are usually maintained by the same group that maintains the other parts of the database. The use of efficient indexes on both primary and foreign keys can dramatically improve query performance. This is because B-tree indexes result in query times proportional to log(n) where n is the number of rows in a table and hash indexes result in constant time queries (no size dependency as long as the relevant part of the index fits into memory).
Relational operations
[edit]Queries made against the relational database, and the derived relvars in the database are expressed in a relational calculus or a relational algebra. In his original relational algebra, Codd introduced eight relational operators in two groups of four operators each. The first four operators were based on the traditional mathematical set operations:
- The union operator (υ) combines the tuples of two relations and removes all duplicate tuples from the result. The relational union operator is equivalent to the SQL UNION operator.
- The intersection operator (∩) produces the set of tuples that two relations share in common. Intersection is implemented in SQL in the form of the INTERSECT operator.
- The set difference operator (-) acts on two relations and produces the set of tuples from the first relation that do not exist in the second relation. Difference is implemented in SQL in the form of the EXCEPT or MINUS operator.
- The cartesian product (X) of two relations is a join that is not restricted by any criteria, resulting in every tuple of the first relation being matched with every tuple of the second relation. The cartesian product is implemented in SQL as the Cross join operator.
The remaining operators proposed by Codd involve special operations specific to relational databases:
- The selection, or restriction, operation (σ) retrieves tuples from a relation, limiting the results to only those that meet a specific criterion, i.e. a subset in terms of set theory. The SQL equivalent of selection is the SELECT query statement with a WHERE clause.
- The projection operation (π) extracts only the specified attributes from a tuple or set of tuples.
- The join operation defined for relational databases is often referred to as a natural join (⋈). In this type of join, two relations are connected by their common attributes. MySQL's approximation of a natural join is the Inner join operator. In SQL, an INNER JOIN prevents a cartesian product from occurring when there are two tables in a query. For each table added to a SQL Query, one additional INNER JOIN is added to prevent a cartesian product. Thus, for N tables in an SQL query, there must be N−1 INNER JOINS to prevent a cartesian product.
- The relational division (÷) operation is a slightly more complex operation and essentially involves using the tuples of one relation (the dividend) to partition a second relation (the divisor). The relational division operator is effectively the opposite of the cartesian product operator (hence the name).
Other operators have been introduced or proposed since Codd's introduction of the original eight including relational comparison operators and extensions that offer support for nesting and hierarchical data, among others.
Normalization
[edit]Normalization was first proposed by Codd as an integral part of the relational model. It encompasses a set of procedures designed to eliminate non-simple domains (non-atomic values) and the redundancy (duplication) of data, which in turn prevents data manipulation anomalies and loss of data integrity. The most common forms of normalization applied to databases are called the normal forms.
RDBMS
[edit]
Connolly and Begg define database management system (DBMS) as a "software system that enables users to define, create, maintain and control access to the database".[23] RDBMS is an extension of that initialism that is sometimes used when the underlying database is relational.
An alternative definition for a relational database management system is a database management system (DBMS) based on the relational model. Most databases in widespread use today are based on this model.[24]
RDBMSs have been a common option for the storage of information in databases used for financial records, manufacturing and logistical information, personnel data, and other applications since the 1980s. Relational databases have often replaced legacy hierarchical databases and network databases, because RDBMS were easier to implement and administer. Nonetheless, relational stored data received continued, unsuccessful challenges by object database management systems in the 1980s and 1990s, (which were introduced in an attempt to address the so-called object–relational impedance mismatch between relational databases and object-oriented application programs), as well as by XML database management systems in the 1990s.[25] However, due to the expanse of technologies, such as horizontal scaling of computer clusters, NoSQL databases have recently become popular as an alternative to RDBMS databases.[26]
Distributed relational databases
[edit]Distributed Relational Database Architecture (DRDA) was designed by a workgroup within IBM in the period 1988 to 1994. DRDA enables network connected relational databases to cooperate to fulfill SQL requests.[27][28] The messages, protocols, and structural components of DRDA are defined by the Distributed Data Management Architecture.
List of database engines
[edit]According to DB-Engines, in December 2024 the most popular systems on the db-engines.com web site were:[29]
- Oracle Database
- MySQL
- Microsoft SQL Server
- PostgreSQL
- Snowflake
- IBM Db2
- SQLite
- Microsoft Access
- Databricks
- MariaDB
According to research company Gartner, in 2011, the five leading proprietary software relational database vendors by revenue were Oracle (48.8%), IBM (20.2%), Microsoft (17.0%), SAP including Sybase (4.6%), and Teradata (3.7%).[30]
See also
[edit]- Comparison of relational database management systems
- Database schema
- Datalog
- Data warehouse
- List of relational database management systems
- Object database (OODBMS)
- Online analytical processing (OLAP) and ROLAP (Relational Online Analytical Processing)
- Relational transducer
- Snowflake schema
- SQL
- Star schema
References
[edit]- ^ Hastings, Jordan (2003). Portable Software Tools for Managing and Referencing Taxonomies. Digital Mapping Techniques '03 Workshop Proceedings. Vol. U.S. Geological Survey Open-File Report 03–471. 2. Relational Database Technology and Taxonomic Representation. Archived from the original on 2014-10-21. Retrieved 2024-04-06 – via United States Geological Survey.
- ^ a b c d Codd, E. F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM. 13 (6): 377–387. doi:10.1145/362384.362685. S2CID 207549016.
- ^ Ambler, Scott (21 March 2023). "Relational Databases 101: Looking at the Whole Picture".[better source needed]
- ^ Date, Chris (5 May 2005). Database in depth: relational theory for practitioners. O'Reilly. ISBN 0-596-10012-4.
- ^ Funding a Revolution: Government Support for Computing Research. National Academies Press. 8 Jan 1999. ISBN 0309062780.
- ^ Sumathi, S.; Esakkirajan, S. (13 Feb 2008). Fundamentals of Relational Database Management Systems. Springer. ISBN 978-3540483977.
The product was called SQL/DS (Structured Query Language/Data Store) and ran under the DOS/VSE operating system environment
- ^ Leavitt, Don (6 September 1976). "MDBM' Backs Network, Relational Approaches". Computerworld. IDG Enterprise.
- ^ Schmidt, Joachim W.; Brodie, Michael L. (1982). Relational Database Systems: Analysis and Comparison. Springer Science & Business Media. pp. 181–220. ISBN 978-3-642-68847-8.
- ^ "Oracle Timeline" (PDF). Profit Magazine. 12 (2). Oracle: 26. May 2007. Retrieved 2013-05-16.
- ^ "New Database Software Program Moves Macintosh Into The Big Leagues". tribunedigital-chicagotribune. 28 June 1987. Retrieved 2016-03-17.
- ^ Hershey, W.R.; Easthope, C.H. (1 December 1972). "A set theoretic data structure and retrieval language". ACM SIGIR Forum. 7 (4). Association for Computing Machinery: 45–55. doi:10.1145/1095495.1095500. Retrieved 4 January 2024.
- ^ SIGFIDET '74: Proceedings of the 1974 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control: Data Models: Data-Structure-Set versus Relational. Association for Computing Machinery. 1 January 1975. doi:10.1145/800297. ISBN 978-1-4503-7418-7. Retrieved 4 January 2024.
- ^ Notley, M.G. (1972). The Peterlee IS/1 System. IBM United Kingdom Scientific Centre. Retrieved 4 January 2024.
- ^ Todd, Stephen (1976). "The Peterlee Relational Test Vehicle - A System Overview". IBM Systems Journal. 15 (4): 285–308. doi:10.1147/sj.154.0285.
- ^ Ramakrishnan, Raghu; Donjerkovic, Donko; Ranganathan, Arvind; Beyer, Kevin S.; Krishnaprasad, Muralidhar (1998). "SRQL: Sorted Relational Query Language" (PDF). E Proceedings of SSDBM.
- ^ "A Relational Database Overview". oracle.com.
- ^ "A universal relation model for a nested database", The Nested Universal Relation Database Model, Lecture Notes in Computer Science, vol. 595, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 109–135, 1992, doi:10.1007/3-540-55493-9_5, ISBN 978-3-540-55493-6, retrieved 2020-11-01
- ^ "Gray to be Honored With A. M. Turing Award This Spring". Microsoft PressPass. 1998-11-23. Archived from the original on 6 February 2009. Retrieved 2009-01-16.
- ^ Gray, Jim (September 1981). "The Transaction Concept: Virtues and Limitations" (PDF). Proceedings of the 7th International Conference on Very Large Databases. Cupertino, CA: Tandem Computers. pp. 144–154. Retrieved 2006-11-09.
- ^ Gray, Jim, and Reuter, Andreas, Distributed Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993. ISBN 1-55860-190-2.
- ^ Date (1984), p. 268.
- ^ Connolly, Thomas M; Begg, Carolyn E (2015). Database systems: a practical approach to design, implementation, and management (global ed.). Boston Columbus Indianapolis: Pearson. p. 416. ISBN 978-1-292-06118-4.
- ^ Connolly, Thomas M.; Begg, Carolyn E. (2014). Database Systems – A Practical Approach to Design Implementation and Management (6th ed.). Pearson. p. 64. ISBN 978-1292061184.
- ^ Pratt, Philip J.; Last, Mary Z. (2014-09-08). Concepts of Database Management (8 ed.). Course Technology. p. 29. ISBN 9781285427102.
- ^ Feuerlich, George (21 April 2010). Dateso 10; Database Trends and Directions: Current Challenges and Opportunities (1st ed.). Prague, Sokolovsk: MATFYZPRESS. pp. 163–174. ISBN 978-80-7378-116-3.
- ^ "NoSQL databases eat into the relational database market". 4 March 2015. Retrieved 2018-03-14.
- ^ Reinsch, R. (1988). "Distributed database for SAA". IBM Systems Journal. 27 (3): 362–389. doi:10.1147/sj.273.0362.
- ^ Distributed Relational Database Architecture Reference. IBM Corp. SC26-4651-0. 1990.
- ^ "DB-Engines Ranking of Relational DBMS". DB-Engines. Retrieved 2024-12-01.
- ^ "Oracle the clear leader in $24 billion RDBMS market". Eye on Oracle. 2012-04-12. Archived from the original on 2014-01-08. Retrieved 2013-03-01.
Sources
[edit]- Date, C. J. (1984). A Guide to DB2 (student ed.). Addison-Wesley. ISBN 0201113171. OCLC 256383726. OL 2838595M.
Relational database
View on GrokipediaOverview
Definition and Principles
A relational database is a type of database management system that organizes data into relations, which are tabular structures consisting of rows (tuples) and columns (attributes), adhering to the relational model introduced by Edgar F. Codd in 1970.[3] This model represents data as sets of relations where each relation captures entities and their associations through shared attributes, enabling efficient storage, retrieval, and manipulation without reliance on physical storage details or navigational paths.[3] The foundational principles of relational databases emphasize data independence, integrity, and declarative querying. Logical data independence ensures that changes to the conceptual schema, such as adding new relations, do not affect application programs, while physical data independence shields users from alterations in storage structures or access methods.[3] Data integrity is enforced through constraints like primary keys, which uniquely identify each tuple in a relation, and referential integrity rules that maintain consistency across relations.[3] Querying relies on set-based operations—such as selection, projection, and join—which treat data as mathematical sets, allowing users to specify what data is needed without detailing how to retrieve it.[3] In contrast to earlier hierarchical and network models, which organize data in tree-like or graph structures requiring explicit navigation along predefined paths, the relational model uses a flat, tabular format that promotes simplicity and flexibility.[3] Hierarchical models limit relationships to parent-child hierarchies, while network models (like CODASYL) allow more complex linkages but often lead to access dependencies and redundancy; the relational approach avoids these by providing a declarative, high-level interface that abstracts away implementation details.[3]Applications and Advantages
Relational databases find extensive application across diverse sectors due to their ability to manage structured data efficiently. In business environments, they underpin customer relationship management (CRM) and enterprise resource planning (ERP) systems, such as SAP, which use them to handle structured processes like customer interactions, inventory tracking, and supply chain operations.[8] Financial services leverage relational databases for secure transaction processing, account management, and compliance reporting in banking and payment systems, where data integrity is paramount.[9] Web applications, including e-commerce platforms like those built with MySQL or PostgreSQL, rely on them to store and retrieve user profiles, product catalogs, and order histories through relational links.[10] Additionally, managed cloud services such as Amazon RDS and Google Cloud SQL offer relational database capabilities with simplified administration and support for engines including MySQL and PostgreSQL. In the no-code/low-code space, platforms like Airtable provide relational capabilities through linked records and a spreadsheet-like interface, enabling users to manage relational data without traditional programming.[6][7][11] In scientific data management, relational databases organize experimental results, patient records, and research metadata, as seen in healthcare studies where they enable consistent querying and analysis of structured datasets.[12] A primary advantage of relational databases is their adherence to ACID properties—atomicity, consistency, isolation, and durability—which ensure reliable and predictable transaction handling, minimizing errors in critical operations like financial transfers.[1] The use of standardized SQL provides portability, allowing queries and schemas to transfer seamlessly between systems, while supporting complex operations such as joins to relate data across tables and aggregations for analytical insights.[9] For structured data, they offer scalability through techniques like indexing, sharding, and vertical scaling, enabling growth in enterprise settings without compromising performance.[13] Despite these strengths, relational databases have limitations when applied to certain data types; they excel with structured information but are less suitable for unstructured data, such as multimedia or semi-structured formats, due to rigid schemas that require predefined structures.[14] Similarly, in high-velocity environments involving real-time streams, their emphasis on ACID compliance can introduce overhead, potentially slowing ingestion compared to more flexible alternatives.[9] As of 2025, relational databases power the majority of enterprise applications, with the global market projected to reach $82.95 billion, reflecting their enduring dominance in structured data environments.[15] Systems like MySQL and PostgreSQL alone account for over 57% of developer usage in surveys, highlighting their widespread adoption.[16]History
Origins and Theoretical Foundations
The relational model originated from the work of Edgar F. Codd, a mathematician at IBM's San Jose Research Laboratory, who began developing the concept in 1969 while addressing challenges in managing large-scale data systems.[17] In June 1970, Codd published his seminal paper, "A Relational Model of Data for Large Shared Data Banks," in Communications of the ACM, introducing a data organization based on mathematical relations to enable shared access to extensive formatted data banks without exposing users to underlying storage details.[3] This work was motivated by the limitations of prevailing navigational database systems, such as hierarchical (tree-structured) and network models like IBM's IMS, which enforced rigid physical linkages, ordering, and access paths that made data retrieval inflexible and dependent on specific program knowledge of the data structure.[18] Codd argued that these systems led to program failures when data organization changed, as applications had to navigate predefined paths, resulting in high maintenance costs and inefficiency for large shared environments.[4] Codd's model sought to overcome these issues through key theoretical motivations, including the elimination of data redundancy to prevent inconsistencies and wasted storage, the assurance of data independence so that changes in physical representation did not affect user queries or applications, and the representation of data using predicate logic for precise, declarative querying.[18] By treating data as relations—n-ary sets of tuples—he proposed a "universal data sublanguage" grounded in first-order predicate calculus, allowing users to specify what data they needed rather than how to retrieve it, thus insulating applications from structural modifications.[18] This approach addressed redundancy by defining a normal form where relations minimized derivable projections, ensuring that data could be reconstructed without duplication while maintaining logical consistency.[18] The theoretical foundations drew heavily from mathematical disciplines, particularly set theory for modeling relations as mathematical sets and first-order logic for query formulation and integrity enforcement.[19] Codd's innovations at IBM built on these principles to create a framework that prioritized user protection from data organization details, stating that "future users of large data banks must be insulated from any changes in the structure of data which are made possible by improvements in base hardware and software technology."[18] To further refine the criteria for true relational systems, Codd later proposed his 12 rules (often counted as 13, including Rule 0: the foundation rule) in 1985, outlining essential properties for a relational database management system, such as guaranteed access via logical addressing and support for view updating.[20]Commercial Development and Adoption
The development of commercial relational database management systems (RDBMS) began in the late 1970s, transitioning from research prototypes to market-ready products. IBM's System R, initiated in 1974 as an internal research project, served as a key prototype that demonstrated the feasibility of relational databases using SQL as the query language, influencing subsequent commercial efforts.[4] In 1979, Oracle Corporation released the first commercially available RDBMS, initially known as Oracle Version 2, which ran on Digital Equipment Corporation hardware and marked a pivotal shift toward enterprise adoption by offering structured query capabilities for business applications.[21] IBM followed with SQL/DS in 1981, targeted at mainframe environments, and later DB2 in 1983, which became a cornerstone for large-scale data processing in corporations.[22] Standardization efforts solidified the relational model's commercial viability. In 1986, the American National Standards Institute (ANSI) published the first SQL standard (SQL-86, or ANSI X3.135), establishing a common syntax for querying relational databases and promoting interoperability across vendors.[23] This was adopted internationally by the ISO in 1987 as SQL-87, with subsequent revisions—such as SQL-92 for enhanced integrity and SQL:1999 for object-relational features—evolving the standard to address growing complexities in data management, culminating in SQL:2023, which includes support for JSON and property graphs.[24] The 1980s saw an enterprise boom in RDBMS adoption, driven by the need for reliable data handling in sectors like finance and manufacturing, with products from Oracle, IBM, and others powering transaction processing systems.[25] In the 1990s, open-source alternatives accelerated widespread use: MySQL was first released in 1995, gaining popularity for web applications due to its simplicity and performance, while PostgreSQL emerged in 1996 from the academic Postgres project, offering advanced features like extensibility for complex queries.[26] The 2000s integrated relational databases with cloud computing, enabling scalable deployments through services like Amazon RDS (launched in 2009), which facilitated on-demand access and reduced infrastructure costs for businesses.[27] By 2025, relational databases maintain dominance in the DBMS market, accounting for approximately 64% of revenue in 2023 according to industry analyses, underscoring their enduring role in handling structured data amid the rise of hybrid environments.[28]Relational Model
Core Concepts and Terminology
In the relational model, a relation is defined as a finite set of ordered n-tuples, where each tuple consists of values drawn from specified domains, mathematically equivalent to a subset of the Cartesian product of those n domains.[3] This structure is typically represented as a table, with no inherent ordering among the tuples or within the attributes, ensuring that the relation remains a set without duplicates.[3] A tuple corresponds to a single row in the relation, forming an n-tuple where the i-th component belongs to the i-th domain.[3] Each attribute represents a column in the relation, named and associated with a specific domain that defines the allowable values for that position across all tuples.[3] The degree of a relation is the number of attributes (n), while its cardinality is the number of distinct tuples it contains.[3] The relational model distinguishes between the schema, which defines the logical structure including relations, attributes, and their domains, and the instance, which is the actual collection of tuples at any given time.[3] Data storage relies on value-based representation, where relationships between data are established solely through shared attribute values rather than physical pointers, ordering, or hierarchical links, promoting data independence.[3] To handle missing or inapplicable information, E.F. Codd later extended the model to include null values, which represent either "value at present unknown" or "property inapplicable," distinct from empty strings or zeros, and integrated into a three-valued logic for queries.[29]Mathematical Basis
The relational model is grounded in set theory and first-order predicate logic, providing a formal framework for data representation and manipulation. At its core, a domain is defined as a set of atomic values that can be assigned to attributes, ensuring type consistency across the database.[3] A relation of degree is formally a subset of the Cartesian product of domains, expressed as , where each element of corresponds to a valid combination of values from these domains.[3] A tuple in this model is a finite, ordered sequence of values, with the -th value drawn from domain , representing a single record or fact.[3] The relation itself is a set of such tuples, inherently enforcing uniqueness since sets do not permit duplicates, which eliminates redundancy at the mathematical level.[3] Formally, a relation comprises a heading and a body: the heading specifies the attribute names paired with their respective domains (e.g., ), defining the structure, while the body is the finite set of tuples populating that structure at any given time.[3] The foundation in predicate logic enables queries to be formulated as logical predicates applied over relations, allowing declarative expressions of data retrieval and manipulation in terms of first-order logic statements.[3] This approach, rooted in an applied predicate calculus, supports relational completeness, where any query expressible in first-order logic can be represented within the model.[3]Data Organization
Relations, Tuples, and Attributes
In the relational model, a relation is conceptualized as a table that organizes data into rows and columns, where the columns represent attributes defining the characteristics of the stored entities, and the rows, known as tuples, capture individual instances or records of those entities.[3] This structure ensures that data is stored in a declarative manner, independent of physical implementation details, allowing users to interact with it through logical representations.[30] To illustrate, consider a simple employee relation named Employees with three attributes: EmployeeID (an integer identifier), Name (a string for the employee's full name), and Department (a string indicating the work unit). Each tuple in this relation would consist of a unique combination of values for these attributes, such as (101, "Alice Johnson", "Engineering"), representing one employee's details without implying any order among the tuples.[31] This tabular format facilitates straightforward comprehension and manipulation of data relationships. Relations are categorized into base relations, which store the actual persistent data in the database (often called base tables in SQL implementations), and derived relations, such as views, which are virtual and computed dynamically from queries on base relations or other views without storing data separately.[3][30] Base relations form the foundational storage, while derived ones provide flexible, on-demand perspectives of the data. Each attribute in a relation must hold only atomic (indivisible, simple) values from its defined domain, prohibiting nested structures like lists or sets within a single cell to maintain the model's simplicity and ensure first normal form compliance.[3] This atomicity requirement, where domains briefly specify the allowable value types (e.g., integers or strings), supports efficient querying and integrity.[32]Domains and Schemas
In the relational model, a domain represents the set of permissible atomic values from which the values of a specific attribute are drawn, ensuring data consistency and type safety across relations. This concept, introduced by E.F. Codd, defines domains as finite or infinite sets of values, such as the domain of integers for numeric attributes or strings for textual ones, preventing invalid entries like non-numeric values in an age field. For instance, the domain for an employee's age attribute might be restricted to integers between 18 and 65, limiting values to that range while excluding extraneous data like negative numbers or decimals.[3][18] A schema in a relational database outlines the structural blueprint, comprising relation schemas that specify the attributes of each table along with their associated domains, and the overall database schema as the integrated collection of these relation schemas, including definitions for views, indexes, and constraints where applicable. Relation schemas thus serve as the foundational descriptors, naming the table and mapping each attribute to its domain, while the database schema provides a holistic view of inter-table organization without delving into data instances. This separation allows for abstract design independent of physical storage, facilitating maintenance and scalability in large systems.[33][34] Modern relational database management systems (RDBMS) implement domains through type systems, offering built-in data types such as INTEGER for whole numbers, VARCHAR for variable-length strings, and DATE for temporal values, which align with the abstract domains of the relational model by enforcing value ranges and formats at the storage level. Users can extend these with user-defined domains, created via SQL statements like CREATE DOMAIN, which base a new type on an existing one while adding custom constraints, such as CHECK conditions to validate specific rules beyond standard types. For example, a user-defined domain for currency might build on DECIMAL with a precision of two places and a non-negative constraint, promoting reusability across attributes.[35][36][37] Schema evolution addresses the need to modify these structures over time in response to changing requirements, involving operations like adding or dropping attributes, altering domains, or renaming relations, often managed through versioning to track historical states and automate migrations. In practice, tools and protocols enable forward and backward compatibility, allowing applications to query evolving schemas without data loss, as demonstrated in industrial case studies where schema changes were applied incrementally to minimize downtime in production environments. This process underscores the relational model's flexibility, though it requires careful planning to preserve integrity during transitions.[38][39][40]Integrity Mechanisms
Keys and Relationships
In the relational model, keys are essential attributes or sets of attributes that ensure uniqueness within a relation and facilitate connections between relations. A superkey is any set of one or more attributes that uniquely identifies each tuple in a relation, allowing no two tuples to share the same values for that set. A candidate key is a minimal superkey, meaning no proper subset of its attributes is itself a superkey; multiple candidate keys may exist for a given relation, such as both employee ID and a combination of name and birthdate uniquely identifying an employee. The primary key is the candidate key selected to serve as the unique identifier for tuples in the relation, with the choice often guided by factors like simplicity and stability; for instance, in an employee relation, employee ID might be chosen as the primary key over a composite of name and address.[3] A foreign key is an attribute or set of attributes in one relation that matches the primary key (or a candidate key) of another relation, establishing a link between them without duplicating data.[3] For example, in a department relation with department ID as the primary key, an employee relation might include department ID as a foreign key to indicate which department each employee belongs to. Foreign keys enable the relational model to represent associations between entities while preserving data integrity through referential constraints, ensuring that referenced values exist in the target relation.[3] Keys define the types of relationships between relations, which describe how tuples in one relation correspond to those in another. A one-to-one relationship occurs when each tuple in one relation is associated with at most one tuple in another, and vice versa; this can be implemented by placing the primary key of one relation as a foreign key in the other, often with mutual foreign keys or by merging relations if appropriate. For instance, a person relation might have a one-to-one link to a passport relation, where passport number serves as both primary and foreign key. A one-to-many relationship exists when each tuple in one relation (the "one" side) can be associated with zero, one, or multiple tuples in another (the "many" side), but each tuple on the "many" side links to at most one on the "one" side; this is typically realized by placing a foreign key in the "many" relation that references the primary key of the "one" relation. In a classic example, a department relation (one side) relates to an employee relation (many side), where employees' department IDs as foreign keys point to the department's primary key, allowing one department to have multiple employees but each employee to belong to only one department. A many-to-many relationship arises when tuples in one relation can associate with multiple tuples in another, and vice versa; direct implementation is avoided to prevent redundancy, instead using a junction (or associative) relation that contains foreign keys referencing the primary keys of both original relations, effectively decomposing the many-to-many into two one-to-many relationships. For example, a student relation and a course relation might connect via an enrollment junction relation with student ID and course ID as foreign keys, capturing multiple enrollments per student and multiple students per course. This structure supports efficient querying and updates while maintaining normalization principles.[3]| Key Type | Definition | Example in Employee Relation |
|---|---|---|
| Superkey | Set of attributes uniquely identifying tuples (may include extras) | {EmployeeID, Name, Address} |
| Candidate Key | Minimal superkey (no subset is a superkey) | {EmployeeID}, {SSN} |
| Primary Key | Selected candidate key for unique identification | EmployeeID |
| Foreign Key | References primary key of another relation | DepartmentID (referencing Departments table) |
Constraints and Referential Integrity
In relational databases, constraints are rules enforced on data to maintain accuracy, consistency, and validity across relations. These mechanisms prevent invalid states by restricting operations that would violate predefined conditions, such as insertions, updates, or deletions that introduce inconsistencies.[41] Entity integrity is a fundamental constraint ensuring that the primary key of every tuple in a relation is neither null nor contains duplicate values, thereby guaranteeing that each entity can be uniquely identified without ambiguity. This rule applies specifically to primary key attributes, prohibiting nulls to uphold the relational model's requirement for identifiable records.[42][43] Referential integrity maintains consistency between related relations by requiring that the value of a foreign key in one relation either matches an existing primary key value in the referenced relation or is null, thus avoiding orphaned records or invalid references. Violations of this constraint occur during operations like deleting a referenced primary key or updating a foreign key to an unmatched value. To handle such violations, database systems support actions including RESTRICT, which blocks the operation if it would break the reference; CASCADE, which propagates the delete or update to dependent foreign keys; SET NULL, which sets the foreign key to null; or SET DEFAULT, which assigns a default value, depending on the system's implementation.[44][45][46] Check constraints enforce custom business rules on attribute values within a relation, such as ensuring an employee's age is greater than 18 or a salary exceeds a minimum threshold, by evaluating a Boolean expression during data modification. These constraints are declarative, specified at the table level, and apply to single or multiple columns, rejecting operations that fail the condition to preserve semantic correctness.[47][48] Unique constraints extend beyond primary keys by ensuring that values in one or more columns are distinct across all tuples in a relation, allowing null values (unlike primary keys) to support alternate unique identifiers, such as email addresses in a user table. This prevents duplicates in non-primary attributes while permitting flexibility for optional uniqueness requirements.[49][47]Querying and Manipulation
Relational Algebra Operations
Relational algebra provides a procedural framework for querying and manipulating relations in the relational model, where each operation takes one or more relations as input and yields a new relation as output. Introduced by Edgar F. Codd in 1970, it emphasizes set-theoretic foundations to ensure data independence and structured manipulation.[3] The operations are designed to be composable, forming a closed system that maintains relational integrity throughout computations.[3] The fundamental operations, often termed primitive, encompass selection, projection, union, set difference, Cartesian product, and rename. These primitives enable the expression of basic data retrieval and combination tasks. Selection, symbolized as , filters tuples from a relation that satisfy a predicate , defined formally as: where involves comparisons like equality or inequality and logical connectives.[50] Projection, denoted , extracts specified attributes from while eliminating duplicates to ensure the result remains a relation, expressed as , with to as the chosen attributes.[50] Union, indicated by , merges tuples from two type-compatible relations and (same arity and corresponding domains), yielding: with duplicates removed.[50] Set difference, using , identifies tuples unique to relative to : applicable only to compatible relations.[50] Cartesian product, , generates all possible pairings of tuples from and : assuming attribute names are distinct or renamed if overlapping.[50] Rename, , reassigns names to relations or attributes, such as to designate as , facilitating composition without name conflicts.[50] Derived operations build upon the primitives to handle common relational tasks more directly, including join, intersection, and division. Natural join, , links and on matching values of shared attributes, equivalent to a theta join (generalized condition) restricted to equality, and formally: where enforces equality on common attributes and selects output attributes.[50] Theta join extends this to arbitrary conditions in , such as inequalities. Intersection, , retrieves shared tuples: derivable as , requiring compatibility.[50] Division, , identifies attribute values in the projection of (excluding 's attributes) that associate with every tuple in : useful for queries like "all parts supplied by every supplier."[50] These operations exhibit closure: any composition results in a valid relation, enabling the construction of arbitrary query expressions through nesting and sequencing. This expressiveness allows relational algebra to represent all information-retrieval requests expressible in the model, serving as the theoretical core for languages like SQL.[3][50]SQL as the Standard Language
SQL, or Structured Query Language, emerged as the declarative language for managing and querying relational databases, providing a standardized interface that translates relational algebra concepts into practical syntax for data operations. Developed in 1974 by Donald D. Chamberlin and Raymond F. Boyce as SEQUEL (Structured English QUEry Language) for IBM's System R research project, it was designed to demonstrate the viability of Edgar F. Codd's relational model in a prototype database system.[51] The language was later shortened to SQL due to trademark issues and evolved through System R's phases, unifying data definition, manipulation, and view mechanisms by 1976.[51] This foundation enabled SQL to become the de facto standard, influencing commercial systems like IBM's SQL/DS and DB2. SQL is categorized into sublanguages that handle distinct aspects of database interaction. Data Definition Language (DDL) includes commands like CREATE and ALTER to define and modify database structures such as tables and schemas, while DROP removes them.[52] Data Manipulation Language (DML) encompasses SELECT for querying data, INSERT for adding rows, UPDATE for modifying existing data, and DELETE for removing rows.[52] Data Control Language (DCL) manages access with GRANT to assign privileges and REVOKE to withdraw them, ensuring security over database objects.[53] At the heart of SQL lies the SELECT statement, which retrieves data from one or more tables using a structured syntax that supports complex filtering and aggregation. The basic form isSELECT column_list FROM table_list [WHERE condition] [GROUP BY columns] [HAVING condition] [ORDER BY columns];, where FROM specifies the source tables, WHERE filters rows before grouping, GROUP BY aggregates data into groups, HAVING applies conditions to groups, and ORDER BY sorts the results.[54] Joins, such as INNER JOIN or LEFT JOIN, combine rows from multiple tables based on related columns, while subqueries—nested SELECT statements—allow embedding queries within clauses like WHERE or FROM for advanced filtering, such as selecting employees with salaries above the departmental average.[54]
SQL's standardization began with ANSI's adoption as X3.135 in 1986, followed by ISO as 9075 in 1987, establishing core syntax and semantics across implementations.[23] The ISO/IEC 9075 standard, now in its 2023 edition, comprises nine parts, including SQL/Foundation for core language elements and optional modules like SQL/JSON for document handling; it defines conformance levels such as Core (mandatory features) and Enhanced (vendor extensions).[23] Over time, SQL has evolved from SQL-86's basic relational operations to SQL:1999's introduction of Common Table Expressions (CTEs) for readable subquery reuse and window functions for analytics like ROW_NUMBER() over ordered partitions without collapsing rows.[55] SQL:2016 added JSON support for storing and querying semi-structured data, while SQL:2023 enhances this with native JSON types, scalar functions, and simplified accessors like dot notation for nested objects, alongside improvements to recursive CTEs for handling cycles in hierarchical data.[55] These advancements support modern analytics workloads while maintaining backward compatibility.[23]
Database Design
Normalization Process
The normalization process in relational databases involves systematically decomposing relations into smaller, well-structured components to eliminate data redundancies and dependency anomalies while preserving the information content of the original database. This step-by-step refinement ensures that the database schema adheres to progressively stricter normal forms, based on constraints known as functional dependencies. The goal is to design a schema that minimizes update, insertion, and deletion anomalies, thereby improving data integrity and consistency.[56] Functional dependencies form the foundational constraints in this process. A functional dependency (FD) exists in a relation R when one set of attributes X functionally determines another set Y, denoted as X → Y, meaning that for any two tuples in R that agree on X, they must also agree on Y. This concept was introduced as part of the relational model to capture semantic relationships between attributes.[18] FDs help identify potential redundancies, such as when non-key attributes depend on only part of a composite key, leading to anomalies during data modifications.[57] To infer all implied FDs from a given set, Armstrong's axioms provide a complete set of inference rules. These axioms, developed by William W. Armstrong, include three primary rules: reflexivity, augmentation, and transitivity. Reflexivity states that if Y is a subset of X, then X → Y holds trivially. Augmentation asserts that if X → Y, then for any Z, XZ → YZ. Transitivity implies that if X → Y and Y → Z, then X → Z. Additional derived rules, such as union and decomposition, can be proven from these basics, ensuring soundness and completeness for FD inference.[58] The normalization process progresses through a series of normal forms, each building on the previous to address specific types of dependencies. First Normal Form (1NF) requires that all attributes in a relation contain atomic (indivisible) values, eliminating repeating groups or multivalued attributes within tuples. This ensures the relation resembles a mathematical table with no nested structures.[57] Second Normal Form (2NF) extends 1NF by requiring that no non-prime attribute (one not part of any candidate key) is partially dependent on any candidate key. In other words, every non-key attribute must depend on the entire candidate key, not just a portion of it. This eliminates partial dependencies, which can cause update anomalies in relations with composite keys. Third Normal Form (3NF) further refines 2NF by prohibiting transitive dependencies, where a non-prime attribute depends on another non-prime attribute rather than directly on a candidate key. A relation is in 3NF if, for every FD X → Y, either X is a superkey or each attribute in Y - X is prime. These forms were formalized to free relations from insertion, update, and deletion dependencies.[56] Boyce-Codd Normal Form (BCNF) imposes a stricter condition than 3NF: for every non-trivial FD X → Y in the relation, X must be a candidate key (a minimal superkey). This addresses cases where 3NF allows determinants that are not superkeys, potentially leading to anomalies in relations with overlapping candidate keys. BCNF ensures every determinant is a candidate key, making it particularly useful for eliminating certain redundancy issues not resolved by 3NF.[57] Higher normal forms target more complex dependencies. Fourth Normal Form (4NF) deals with multivalued dependencies (MVDs), where an attribute set is independent of another but both depend on a common key. A relation is in 4NF if it is in BCNF and has no non-trivial MVDs other than those implied by FDs. This prevents redundancy from independent multivalued facts, such as multiple hobbies per person unrelated to skills. MVDs generalize FDs and were defined to capture such scenarios.[59] Fifth Normal Form (5NF), also known as Project-Join Normal Form (PJ/NF), addresses join dependencies, where a relation can be decomposed into projections that can be rejoined without spurious tuples. A relation is in 5NF if it is in 4NF and every join dependency is implied by the candidate keys. This form eliminates anomalies from cyclic dependencies across multiple relations.[60] In practice, the normalization process begins by identifying all relevant FDs (and higher dependencies for advanced forms) using domain knowledge and Armstrong's axioms to compute closures. The schema is then decomposed iteratively: for violations of a target normal form, select an offending FD X → Y, project the relation into R1 = (X Y) and R2 = (attributes of R - Y), and replace the original with these projections. Decompositions must be lossless—meaning the natural join of the projections equals the original relation without spurious tuples—to preserve data. This is verified if the FDs include a condition where one projection's key is contained in the other. The process continues until the schema satisfies the desired normal form, balancing integrity with query efficiency.[61]Denormalization and Performance Considerations
Denormalization involves intentionally introducing redundancy into a relational database schema that has been normalized to higher normal forms, such as third normal form (3NF), to enhance query performance at the expense of storage efficiency and data consistency maintenance.[62] This technique counters the strict elimination of redundancy in normalization by selectively duplicating data, thereby reducing the computational overhead of joins and aggregations during read operations.[63] Common denormalization strategies include creating pre-joined tables, where data from multiple normalized tables is combined into a single table to eliminate runtime joins for frequently queried combinations. For example, in an e-commerce system, customer and order details might be merged into one table to speed up retrieval of order histories.[62] Another approach is implementing summary aggregates, which precompute and store results of common aggregation functions like sums or averages, avoiding repeated calculations on large datasets. This is particularly useful for reporting queries involving totals, such as monthly sales figures stored directly in a denormalized summary table.[63] Clustering, as a design strategy, groups related records physically or logically within tables to minimize data scattering, facilitating faster scans and range queries without relying solely on indexes.[64] The primary trade-offs of denormalization center on improved read performance versus increased risks of update anomalies and higher storage costs. By duplicating data, queries can execute faster—often reducing response times by orders of magnitude for join-heavy operations—but updates require propagating changes across redundant copies, potentially leading to inconsistencies if not managed carefully.[62] Storage overhead rises due to redundancy, which can be significant in large-scale systems, though this is offset in read-intensive environments where query speed is paramount.[63] Denormalization is most appropriate for high-read workloads, such as analytical reporting or online analytical processing (OLAP) systems, where complex queries dominate over frequent updates typical in online transaction processing (OLTP). In OLAP scenarios, denormalized schemas support multidimensional analysis by flattening hierarchies, enabling sub-second responses on terabyte-scale data.[64] Conversely, OLTP environments, focused on concurrent transactions, generally avoid extensive denormalization to preserve data integrity during writes.[64] Modern relational database management systems (RDBMS) provide materialized views as a controlled mechanism for denormalization, storing precomputed query results that can be refreshed periodically or incrementally. These views act as virtual denormalized tables, combining the benefits of redundancy for fast reads with automated maintenance to mitigate update anomalies.[65] For instance, Oracle's materialized views support equi-joins and aggregations optimized for data warehousing, reducing query times while integrating with the underlying normalized schema.[65] This approach, rooted in incremental view maintenance techniques, balances performance gains with consistency in production environments.[66]Advanced Features
Transactions and ACID Properties
In relational databases, a transaction is defined as a logical unit of work consisting of a sequence of operations, such as reads and writes, that are executed as a single, indivisible entity to maintain data integrity. Transactions typically begin with a BEGIN statement, proceed through a series of database operations, and conclude with either a COMMIT to permanently apply the changes or a ROLLBACK to undo them entirely, ensuring that partial failures do not leave the database in an inconsistent state. This mechanism allows complex operations, like transferring funds between accounts, to be treated atomically, preventing issues such as overdrafts if one step fails. The reliability of transactions in relational databases is ensured through the ACID properties (atomicity, consistency, isolation, durability), a set of guarantees that ensure reliable transaction processing; the acronym was coined by Theo Härder and Andreas Reuter in 1983.[67] Atomicity requires that a transaction is executed completely or not at all; if any operation fails, the entire transaction is rolled back, restoring the database to its pre-transaction state. Consistency mandates that a transaction brings the database from one valid state to another, preserving integrity constraints such as primary keys, foreign keys, and check constraints after completion. Isolation ensures that concurrent transactions do not interfere with each other, making each transaction appear to execute in isolation even when running in parallel. Durability guarantees that once a transaction is committed, its effects are permanently stored, surviving subsequent system failures through techniques like write-ahead logging. To balance isolation with performance in multi-user environments, relational databases implement varying isolation levels as defined by the ANSI SQL standard, which specify the degree to which concurrent transactions are shielded from each other's effects.[68] The read uncommitted level allows a transaction to read data modified by another uncommitted transaction, potentially leading to dirty reads but maximizing concurrency.[68] Read committed prevents dirty reads by ensuring reads only from committed data, though it permits non-repeatable reads where the same query may yield different results within a transaction.[68] Repeatable read avoids non-repeatable reads by locking read data until the transaction ends, but it may still allow phantom reads from new insertions by other transactions.[68] The strictest, serializable, fully emulates sequential execution, preventing all anomalies including phantoms through techniques like locking or timestamping, at the cost of reduced concurrency.[68] For distributed relational databases spanning multiple nodes, the two-phase commit (2PC) protocol coordinates transactions to achieve atomicity and consistency across sites.[69] In the first phase, a coordinator polls participants to prepare the transaction; each votes yes if it can commit locally or no if it cannot, with all logging their intent durably.[69] If all vote yes, the second phase issues a global commit, propagating the decision; otherwise, an abort is sent, and all roll back.[69] This ensures that either all sites commit or none do, though it can block if the coordinator fails, requiring recovery mechanisms.[69]Stored Procedures, Triggers, and Views
Stored procedures are pre-compiled blocks of SQL code stored in the database that can be invoked repeatedly to perform complex operations, such as data manipulation or business logic execution, often with input and output parameters for flexibility.[70] They originated as an extension to SQL in commercial RDBMS implementations, with Oracle introducing PL/SQL stored procedures in Oracle7 in 1992 to enhance reusability and reduce network traffic by executing code server-side.[71] Stored procedures support error handling through exception blocks and can include conditional logic, making them suitable for encapsulating database-side programming.[70] A basic example of creating a stored procedure in PL/SQL, Oracle's procedural extension to SQL, is as follows:CREATE OR REPLACE PROCEDURE update_employee_salary(emp_id IN NUMBER, raise_pct IN NUMBER)
IS
BEGIN
UPDATE employees
SET salary = salary * (1 + raise_pct / 100)
WHERE employee_id = emp_id;
IF SQL%ROWCOUNT = 0 THEN
RAISE_APPLICATION_ERROR(-20001, 'Employee not found');
END IF;
COMMIT;
EXCEPTION
WHEN OTHERS THEN
ROLLBACK;
RAISE;
END update_employee_salary;
CREATE OR REPLACE PROCEDURE update_employee_salary(emp_id IN NUMBER, raise_pct IN NUMBER)
IS
BEGIN
UPDATE employees
SET salary = salary * (1 + raise_pct / 100)
WHERE employee_id = emp_id;
IF SQL%ROWCOUNT = 0 THEN
RAISE_APPLICATION_ERROR(-20001, 'Employee not found');
END IF;
COMMIT;
EXCEPTION
WHEN OTHERS THEN
ROLLBACK;
RAISE;
END update_employee_salary;
CREATE TRIGGER tr_employees_audit
ON employees
AFTER UPDATE
AS
BEGIN
INSERT INTO audit_log (table_name, operation, changed_at)
SELECT 'employees', 'UPDATE', GETDATE()
WHERE @@ROWCOUNT > 0;
END;
CREATE TRIGGER tr_employees_audit
ON employees
AFTER UPDATE
AS
BEGIN
INSERT INTO audit_log (table_name, operation, changed_at)
SELECT 'employees', 'UPDATE', GETDATE()
WHERE @@ROWCOUNT > 0;
END;
CREATE VIEW active_employees AS
SELECT employee_id, first_name, last_name, department
FROM employees
WHERE status = 'active';
CREATE VIEW active_employees AS
SELECT employee_id, first_name, last_name, department
FROM employees
WHERE status = 'active';
SELECT * FROM active_employees) returns only current employees, abstracting the full table and enforcing access controls.[78] Views thus facilitate modular database design by decoupling applications from physical schema changes.[76]
