Recent from talks
Nothing was collected or created yet.
Third normal form
View on WikipediaThird normal form (3NF) is a level of database normalization defined by English computer scientist Edgar F. Codd. A relation (or table, in SQL) is in third normal form if it is in second normal form and also lacks non-key dependencies, meaning that no non-prime attribute is functionally dependent on (that is, contains a fact about) any other non-prime attribute. In other words, each non-prime attribute must depend solely and non-transitively on each candidate key.[1] William Kent summarised 3NF with the dictum that "a non-key field must provide a fact about the key, the whole key, and nothing but the key".[2][citation needed]
An example of a violation of 3NF would be a Patient relation with the attributes PatientID, DoctorID and DoctorName, in which DoctorName would depend first and foremost on DoctorID and only transitively on the key, PatientID (via DoctorID's dependency on PatientID). Such a design would cause a doctor's name to be redundantly duplicated across each of their patients. A database compliant with 3NF would store doctors' names in a separate Doctor relation which Patient could reference via a foreign key.
3NF was defined, along with 2NF (which forbids dependencies on proper subsets of composite keys), in Codd's paper "Further Normalization of the Data Base Relational Model" in 1971,[3] which came after 1NF's definition in "A Relational Model of Data for Large Shared Data Banks" in 1970.[citation needed] 3NF was itself followed by the definition of Boyce–Codd normal form in 1974, which seeks to prevent anomalies possible in relations with several overlapping composite keys.
Definition of third normal form
[edit]Codd's definition states that a relation R is in 3NF if and only if it is in second normal form (2NF) and every non-prime attribute of R is non-transitively dependent on each candidate key. A non-prime attribute of R is an attribute that does not belong to any candidate key of R.[4]
Codd defines a transitive dependency of an attribute set Z on an attribute set X as a functional dependency chain X → Y → Z that must be satisfied for some attribute set Y, where it is not the case that Y → X, and all three sets must be disjoint.[5]
A 3NF definition that is equivalent to Codd's, but expressed differently, was given by Carlo Zaniolo in 1982. This definition states that a table is in 3NF if and only if for each of its functional dependencies X → Y, at least one of the following conditions holds:[6][7][need quotation to verify]
- X contains Y (that is, Y is a subset of X, meaning X → Y is a trivial functional dependency),
- X is a superkey,
- every element of Y \ X, the set difference between Y and X, is a prime attribute (i.e., each attribute in Y \ X is contained in some candidate key).
To rephrase Zaniolo's definition more simply, the relation is in 3NF if and only if for every non-trivial functional dependency X → Y, X is a superkey or Y \ X consists of prime attributes. Zaniolo's definition gives a clear sense of the difference between 3NF and the more stringent Boyce–Codd normal form (BCNF). BCNF simply eliminates the third alternative ("Every element of Y \ X, the set difference between Y and X, is a prime attribute.").
The definition offered by Zaniolo can be shown to be equivalent to the Codd definition in the following way: let X → A be a nontrivial functional dependency (i.e., one where X does not contain A) and let A be a non-prime attribute. Also let Y be a candidate key of R. Then Y → X. Further since A is a non-prime attribute, therefore A cannot determine X (A → X not possible) because in that case AY would form the super key. Therefore, A is not transitively dependent on Y (X is not prime attribute as per 2NF but both Y and X can be non-primes without following the Codd definition for 3NF) if and only if there is a functional dependency X → Y (simply reversing one of the dependency to avoid transitivity), i.e., if and only if X is a superkey of R. It is to be noted that either or each of A, X and Y can be single attributes or a combination thereof but are necessarily disjoint. One can write X → Y equivalently as X → XY and one may thus observe the Zaniolo equivalence for Codd by performing the set difference between the dependent and the determinant.
Example
[edit]Design which violates 3NF
[edit]The following relation, with the composite key {Name, Year}, fails to meet the requirements of 3NF. The non-prime attributes WinnerName and WinnerBirthdate are only transitively dependent on the composite key via their dependence on the non-prime attribute WinnerID. This creates redundancy and the potential for inconsistency in the case that a winner of multiple tournaments is accidentally given different dates of birth in different tuples.
| Name | Year | WinnerID | WinnerName | WinnerBirthdate |
|---|---|---|---|---|
| Indiana Invitational | 1998 | 1 | Al Fredrickson | 1975-07-21 |
| Cleveland Open | 1999 | 2 | Bob Albertson | 1968-09-28 |
| Des Moines Masters | 1999 | 1 | Al Fredrickson | 1975-07-21 |
| Indiana Invitational | 1999 | 3 | Chip Masterson | 1977-03-14 |
Design which complies with 3NF
[edit]To bring the relation into compliance with 3NF, WinnerID, WinnerName and WinnerBirthdate can be transferred to a separate table.
| Name | Year | WinnerID |
|---|---|---|
| Indiana Invitational | 1998 | 1 |
| Cleveland Open | 1999 | 2 |
| Des Moines Masters | 1999 | 1 |
| Indiana Invitational | 1999 | 3 |
| WinnerID | Name | Birthdate |
|---|---|---|
| 1 | Al Fredrickson | 1975-07-21 |
| 2 | Bob Albertson | 1968-09-28 |
| 3 | Chip Masterson | 1977-03-14 |
Tournament's WinnerID attribute now acts as a foreign key referencing the primary key of Winner. Unlike before, it is not possible for a winner to be associated with multiple dates of birth.
"Nothing but the key"
[edit]A paraphrase of Codd's definition of 3NF parodying the traditional oath to tell the truth in a court of law was given by William Kent: "a non-key field must provide a fact about the key, the whole key, and nothing but the key".[2] Requiring that non-key attributes be dependent on "the whole key" ensures compliance with 2NF, and further requiring their dependency on "nothing but the key" ensures compliance with 3NF. A common variation supplements the paraphrase with the addendum "so help me Codd".[8]
While the phrase is a useful mnemonic, the mention of only a single key makes fulfilling it necessary but not sufficient to satisfy 2NF and 3NF, both of which are concerned with all candidate keys of a relation and not just any one.[citation needed]
Christopher J. Date notes that, adapted to refer to all fields rather than just non-key fields, the summary can also encompass the slightly stronger Boyce–Codd normal form, in which prime attributes must not be functionally dependent at all.[9] Prime attributes are considered to provide a fact about the key in the sense of providing part or all of the key itself. (This rule applies only to functionally dependent attributes, as applying it to all attributes would implicitly prohibit composite keys, since each part of any such key would violate the "whole key" clause.)
Computation
[edit]This section may be confusing or unclear to readers. (July 2025) |
A relation can always be decomposed in third normal form, that is, the relation R is rewritten to projections R1, ..., Rn whose join is equal to the original relation. Further, this decomposition does not lose any functional dependency, in the sense that every functional dependency on R can be derived from the functional dependencies that hold on the projections R1, ..., Rn. What is more, such a decomposition can be computed in polynomial time.[10]
To decompose a relation into 3NF from 2NF, break the table into the canonical cover functional dependencies, then create a relation for every candidate key of the original relation which was not already a subset of a relation in the decomposition.[11]
Considerations for use in reporting environments
[edit]This section may be confusing or unclear to readers. (July 2025) |
While 3NF was ideal for machine processing, the segmented nature of the data model can be difficult to intuitively consume by a human user. Analytics via query, reporting, and dashboards were often facilitated by a different type of data model that provided pre-calculated analysis such as trend lines, period-to-date calculations (month-to-date, quarter-to-date, year-to-date), cumulative calculations, basic statistics (average, standard deviation, moving averages) and previous period comparisons (year ago, month ago, week ago) e.g. dimensional modeling and beyond dimensional modeling, flattening of stars via Hadoop and data science.[12][13] Hadley Wickham's "tidy data" framework is 3NF, with "the constraints framed in statistical language".[14]
See also
[edit]- Attribute-value system
- First normal form (1NF)
- Second normal form (2NF)
- Boyce–Codd normal form (BCNF or 3.5NF)
- Fourth normal form (4NF)
- Fifth normal form (5NF)
- Sixth normal form (6NF)
References
[edit]- ^ Codd, E. F. "Further Normalization of the Data Base Relational Model", p. 34.
- ^ a b Kent, William. "A Simple Guide to Five Normal Forms in Relational Database Theory", Communications of the ACM 26 (2), Feb. 1983, pp. 120–125.
- ^ Codd, E. F. "Further Normalization of the Data Base Relational Model". (Presented at Courant Computer Science Symposia Series 6, "Data Base Systems", New York City, May 24–25, 1971.) IBM Research Report RJ909 (August 31, 1971). Republished in Randall J. Rustin (ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 1972.
- ^ Codd, p. 43.
- ^ Codd, p. 45–46.
- ^ Zaniolo, Carlo. "A New Normal Form for the Design of Relational Database Schemata". ACM Transactions on Database Systems 7(3), September 1982.
- ^ Abraham Silberschatz, Henry F. Korth, S. Sudarshan, Database System Concepts (5th edition), p. 276–277.
- ^ The author of a 1989 book on database management credits one of his students with coming up with the "so help me Codd" addendum. Diehr, George. Database Management (Scott, Foresman, 1989), p. 331.
- ^ Date, C. J. An Introduction to Database Systems (7th ed.) (Addison Wesley, 2000), p. 379.
- ^ Serge Abiteboul, Richard B. Hull, Victor Vianu: Foundations of Databases. Addison-Wesley, 1995. http://webdam.inria.fr/Alice/ ISBN 0201537710. Theorem 11.2.14.
- ^ Hammo, Bassam. "Decomposition, 3NF, BCNF" (PDF). Archived (PDF) from the original on 2023-03-15.
- ^ "Comparisons between Data Warehouse modelling techniques – Roelant Vos". Roelant Vos. 12 February 2013. Retrieved 5 March 2018.
- ^ "Hadoop Data Modeling Lessons | EMC". InFocus Blog | Dell EMC Services. 23 September 2014. Retrieved 5 March 2018.
- ^ Wickham, Hadley (2014-09-12). "Tidy Data". Journal of Statistical Software. 59 (10): 1–23. doi:10.18637/jss.v059.i10. ISSN 1548-7660.
Further reading
[edit]- Date, C. J. (1999), * Date, C. J. (1999). An Introduction to Database Systems (8th ed.). Addison-Wesley Longman. ISBN 0-321-19784-4. Archived from the original on 4 April 2005.
- Kent, William (1983). "A Simple Guide to Five Normal Forms in Relational Database Theory". Communications of the ACM. 26 (2): 120–126. doi:10.1145/358024.358054.
External links
[edit]- Litt's Tips: Normalization
- Database Normalization Basics by Mike Chapple (About.com)
- An Introduction to Database Normalization by Mike Hillyer.
- A tutorial on the first 3 normal forms by Fred Coulson
- Description of the database normalization basics by Microsoft
- Third Normal Form with Simple Examples by exploreDatabase
Third normal form
View on GrokipediaNormalization Prerequisites
Overview of Database Normalization
Database normalization is a systematic process for organizing data in a relational database to minimize redundancy and dependency, thereby enhancing data integrity and consistency. This technique structures tables and their relationships to ensure that data is stored efficiently without unnecessary duplication, which can lead to inconsistencies during operations.[5] The concept originated with Edgar F. Codd's seminal 1970 paper, "A Relational Model of Data for Large Shared Data Banks," which introduced the relational model and laid the foundation for normalization as a means to maintain data independence and logical structure in shared data systems.[6] Normalization addresses key goals in database design, including the elimination of insertion anomalies (difficulty adding new data without extraneous information), update anomalies (inconsistent changes across duplicated data), and deletion anomalies (loss of unrelated data when removing records), ultimately promoting data consistency and reducing maintenance efforts.[7][8] In relational databases, core terminology includes relations (equivalent to tables), attributes (columns defining data properties), and tuples (rows representing individual records). Normalization progresses through a hierarchy of normal forms, starting from first normal form (1NF) and advancing to higher levels such as second normal form (2NF) and beyond, with each subsequent form building on the previous to provide incremental refinements in data organization and anomaly prevention.[6][5] This progression relies on concepts like functional dependencies, which describe how attribute values determine others, guiding the decomposition of relations into more refined structures.[9]First and Second Normal Forms
First normal form (1NF) requires that a relation consists of atomic values in each attribute, ensuring no repeating groups or multivalued attributes within a single cell. This means every entry in the relation must be indivisible and represent a single value from its domain, eliminating nested structures or lists that could complicate data retrieval and integrity.[6] The purpose of 1NF is to establish a foundational structure where each tuple can be uniquely identified by a primary key, facilitating consistent querying and updates without ambiguity from composite or non-atomic data.[6] Consider an unnormalized table tracking student enrollments, where the "Courses" attribute contains multiple values separated by commas:| StudentID | StudentName | Courses |
|---|---|---|
| 101 | Alice | Math, Physics |
| 102 | Bob | Chemistry, Math |
| StudentID | StudentName | Course |
|---|---|---|
| 101 | Alice | Math |
| 101 | Alice | Physics |
| 102 | Bob | Chemistry |
| 102 | Bob | Math |
| OrderID | ProductID | SupplierName | Quantity |
|---|---|---|---|
| 001 | P1 | Acme Corp | 5 |
| 001 | P2 | Beta Inc | 3 |
| 002 | P1 | Acme Corp | 2 |
| OrderID | ProductID | Quantity |
|---|---|---|
| 001 | P1 | 5 |
| 001 | P2 | 3 |
| 002 | P1 | 2 |
| ProductID | SupplierName |
|---|---|
| P1 | Acme Corp |
| P2 | Beta Inc |
Core Definition
Formal Statement of 3NF
Third normal form (3NF) was introduced by E. F. Codd in 1972 to further refine relational database structures by addressing redundancies beyond those handled in second normal form, as part of establishing rules for relational integrity.[10] A relation schema with attributes divided into prime attributes (those belonging to some candidate key) and non-prime attributes is in 3NF if it is already in second normal form (2NF) and every non-prime attribute is non-transitively dependent on every candidate key of .[10] This condition ensures that no non-prime attribute depends indirectly on a key through another non-key attribute. A transitive dependency in this context arises from a chain of functional dependencies and , where is a candidate key, is a non-prime attribute (not a superkey), and is another non-prime attribute, implying an indirect dependency that violates direct dependence on the key.[10] An equivalent formulation states that a relation is in 3NF if, for every non-trivial functional dependency holding in , either is a superkey of , or is a prime attribute.[11] This transitive dependency condition is equivalent to the functional dependency-based definition commonly used in modern database theory. By extending 2NF—which eliminates partial dependencies—3NF specifically targets transitive dependencies among non-key attributes, thereby reducing update anomalies and improving data consistency in relational models.[10]Role of Functional Dependencies
Functional dependencies (FDs) serve as the cornerstone for analyzing and achieving third normal form in relational databases by capturing the semantic constraints that dictate how attribute values are interrelated within a relation. A functional dependency X → Y holds in a relation R if, for every pair of tuples in R that agree on all attributes in X, they also agree on all attributes in Y; this means X uniquely determines Y. FDs are classified into several types based on their structure and implications: a trivial FD occurs when Y is a subset of X, as it always holds without additional constraints; a non-trivial FD has Y not entirely contained in X; full FDs indicate that no proper subset of X determines Y, contrasting with partial FDs where a subset does; and transitive FDs arise when X → Z → Y implies X → Y indirectly. The logical implications of FDs are governed by Armstrong's axioms, a complete set of inference rules for deriving all valid FDs from a given set. These include reflexivity (if Y ⊆ X, then X → Y), augmentation (if X → Y, then XZ → YZ for any Z), and transitivity (if X → Y and Y → Z, then X → Z). Derived rules extend these, such as union (if X → Y and X → Z, then X → YZ), decomposition (if X → YZ, then X → Y and X → Z), and pseudotransitivity (if X → Y and WY → Z, then WX → Z). These axioms enable systematic reasoning about dependencies, ensuring that any FD inferred logically holds in the relation. Identifying FDs typically involves semantic analysis, where domain experts derive them from business rules and entity relationships to reflect real-world constraints. Alternatively, empirical methods mine FDs directly from data samples using algorithms that scan relations to detect dependencies, such as heuristic-driven searches for minimal covers in large datasets. To facilitate analysis, FD sets are often simplified into a canonical cover, a minimal equivalent set where each FD is non-redundant, left-reduced (no extraneous attributes on the left side), and right-reduced (no extraneous on the right). The process involves repeatedly removing redundant FDs and extraneous attributes using closure computations until no further simplifications are possible. The closure of an attribute set X, denoted , comprises all attributes in the relation that are functionally determined by X, computed iteratively by applying the given FDs and Armstrong's axioms starting from X until no new attributes are added. This closure is essential for verifying keys, testing implications, and simplifying FD sets.Illustrative Examples
Design Violating 3NF
A design violates third normal form (3NF) when it is in second normal form (2NF) but contains transitive dependencies, where a non-prime attribute depends on another non-prime attribute rather than directly on a candidate key.[12] Consider a relation schema called Employee with attributes EmpID (primary key), Dept, DeptLocation, and Skill. The functional dependencies (FDs) are: EmpID → Dept, EmpID → Skill, and Dept → DeptLocation. Here, Skill and Dept depend directly on the primary key EmpID, but DeptLocation depends on Dept (a non-key attribute), creating a transitive dependency EmpID → Dept → DeptLocation.[13] This relation is in 2NF because the primary key is a single attribute, so there are no partial dependencies on only part of a composite key. However, it fails 3NF due to the transitive dependency on the non-prime attribute DeptLocation.[12] To illustrate, suppose the Employee relation contains the following data, where multiple employees share the same department and thus the same department location, leading to redundancy:| EmpID | Skill | Dept | DeptLocation |
|---|---|---|---|
| 101 | Java | IT | New York |
| 102 | Python | IT | New York |
| 103 | SQL | HR | London |
| EmpID | Skill | Dept | DeptLocation |
|---|---|---|---|
| 101 | Java | IT | New York |
| 102 | Python | IT | New York |
Design Complying with 3NF
To achieve compliance with third normal form (3NF), a database schema exhibiting transitive dependencies must be refactored by decomposing tables such that no non-prime attribute is dependent on another non-prime attribute.[10] Consider a typical violating design where an Employee table includes attributes for employee ID (EmpID, primary key), department (Dept), skill (Skill), and department location (DeptLocation), with functional dependencies (FDs) EmpID → Dept, EmpID → Skill, and Dept → DeptLocation. The transitive dependency Dept → DeptLocation violates 3NF because DeptLocation depends on Dept, which is not a key. To comply, decompose into two tables: Employee (EmpID [PK], Dept [FK], Skill) and Department (Dept [PK], DeptLocation). This separation ensures all non-prime attributes depend directly on candidate keys, eliminating the transitive dependency.[10] Verification of 3NF compliance involves confirming the schema meets second normal form (2NF) prerequisites and has no transitive dependencies. In the Employee table, FDs are EmpID → Dept and EmpID → Skill, with both non-prime attributes (Dept, Skill) depending solely on the primary key EmpID; no partial or transitive issues exist. In the Department table, the FD Dept → DeptLocation ensures DeptLocation depends only on the primary key Dept. Joins via the foreign key Dept in Employee referencing Department allow reconstruction of the original data without redundancy.[14] The following markdown tables illustrate a side-by-side comparison of the original violating design and the refactored 3NF-compliant schema, assuming sample data for three employees in two departments: Original Violating Table: Employee| EmpID | Dept | Skill | DeptLocation |
|---|---|---|---|
| 101 | Sales | Marketing | New York |
| 102 | Sales | Sales | New York |
| 103 | IT | Coding | San Francisco |
| Dept | DeptLocation |
|---|---|
| Sales | New York |
| IT | San Francisco |
Theoretical Foundations
The "Nothing but the Key" Principle
The "Nothing but the Key" principle serves as an informal mnemonic to intuitively grasp the requirements of third normal form (3NF) in relational database design. Coined by William Kent, it states that every non-key attribute in a relation must provide a fact about the key, the whole key, and nothing but the key.[13] This slogan parallels the oath taken in court to emphasize completeness and exclusivity in attribute dependencies. The breakdown of the phrase highlights key aspects of normalization. The "whole key" portion ensures that non-key attributes depend on the entire candidate key, thereby eliminating partial dependencies addressed in second normal form (2NF).[13] Meanwhile, "nothing but the key" prevents non-key attributes from depending on other non-key attributes, avoiding transitive dependencies that could lead to update anomalies.[13] In relation to functional dependencies, the principle ensures that every non-key attribute is functionally determined solely by the key, without any non-key attribute being determined by another non-key attribute.[1] This mnemonic originated as an elaboration on E.F. Codd's formal introduction of 3NF in 1972, aimed at making normalization concepts more accessible to database practitioners.[1][13] However, the principle oversimplifies dependency theory and fails to account for edge cases like multivalued dependencies, which require higher normal forms such as fourth normal form (4NF) for resolution.[13]Algorithms for 3NF Decomposition
The synthesis algorithm for third normal form (3NF) decomposition, originally proposed by Bernstein, provides a systematic procedure to transform a relational schema into a set of 3NF relations while preserving key properties of the original design.[16] Given a relation schema with attribute set and a set of functional dependencies (FDs) over , the algorithm first computes a canonical cover of , which is a minimal equivalent set of FDs with single attributes on the right-hand side and no redundant FDs or attributes.[16] This step involves iteratively removing redundant FDs by checking if each FD is implied by the others using attribute closure computations and decomposing multi-attribute right-hand sides into individual FDs.[16] The core decomposition proceeds as follows: for each FD in , create a relation schema consisting of the attributes .[16] If none of the resulting schemas contains a candidate key of , add a new schema comprising one such key.[16] Finally, if any schema has only a single attribute and is subsumed by another schema, merge it into the superseding schema to avoid trivial relations; similarly, combine schemas sharing the same set of attributes.[16] This process yields a decomposition into relations, each of which satisfies the 3NF condition as defined by Codd, where for every non-trivial FD in the projected FDs, either is a superkey or is a prime attribute.[16] The algorithm guarantees a lossless-join decomposition, meaning the natural join of the decomposed relations reconstructs the original relation without spurious tuples, as each relation includes determinants from the canonical cover, ensuring the join dependencies align with the FDs.[16] It also preserves dependencies, such that the union of the FDs projected onto each decomposed relation logically implies the original set , allowing enforcement of all constraints locally without recomputing closures across relations.[16] To verify whether a given schema is already in 3NF, a checking algorithm examines the FDs for violations of the form.[16] Compute the canonical cover of the given FDs. For each FD in where is a non-prime attribute, determine if is a superkey by computing the attribute closure (under ) and checking if ; if not, and no such violation holds for prime , the schema is in 3NF.[16] This verifies the absence of transitive dependencies by ensuring no non-superkey determinant implies a non-prime attribute transitively. The following pseudocode outlines the synthesis algorithm abstractly:Input: Relation R, FD set F
Output: Set of 3NF relations D
1. Compute canonical cover Fc of F // Using closure checks to remove redundancies
2. D = [empty set](/page/Empty_set)
3. For each FD X → A in Fc:
Add relation (X ∪ {A}) to D
4. If no relation in D contains a [candidate key](/page/Candidate_key) K of R:
Add relation K to D
5. For each pair of relations Ri, Rj in D where Ri ⊆ Rj:
Remove Ri from D // Merge subsumed relations
Return D
Input: Relation R, FD set F
Output: Set of 3NF relations D
1. Compute canonical cover Fc of F // Using closure checks to remove redundancies
2. D = [empty set](/page/Empty_set)
3. For each FD X → A in Fc:
Add relation (X ∪ {A}) to D
4. If no relation in D contains a [candidate key](/page/Candidate_key) K of R:
Add relation K to D
5. For each pair of relations Ri, Rj in D where Ri ⊆ Rj:
Remove Ri from D // Merge subsumed relations
Return D
