Recent from talks
Contribute something
Nothing was collected or created yet.
Data model
View on Wikipedia
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities.[2][3] For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner.
The corresponding professional activity is called generally data modeling or, more specifically, database design. Data models are typically specified by a data expert, data specialist, data scientist, data librarian, or a data scholar. A data modeling language and notation are often represented in graphical form as diagrams.[4]
A data model can sometimes be referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models.
A data model explicitly determines the structure of data; conversely, structured data is data organized according to an explicit data model or data structure. Structured data is in contrast to unstructured data and semi-structured data.
Overview
[edit]The term data model can refer to two distinct but closely related concepts. Sometimes it refers to an abstract formalization of the objects and relationships found in a particular application domain: for example the customers, products, and orders found in a manufacturing organization. At other times it refers to the set of concepts used in defining such formalizations: for example concepts such as entities, attributes, relations, or tables. So the "data model" of a banking application may be defined using the entity–relationship "data model". This article uses the term in both senses.
Managing large quantities of structured and unstructured data is a primary function of information systems. Data models describe the structure, manipulation, and integrity aspects of the data stored in data management systems such as relational databases. They may also describe data with a looser structure, such as word processing documents, email messages, pictures, digital audio, and video: XDM, for example, provides a data model for XML documents.
The role of data models
[edit]
The main aim of data models is to support the development of information systems by providing the definition and format of data. According to West and Fowler (1999) "if this is done consistently across systems then compatibility of data can be achieved. If the same data structures are used to store and access data then different applications can share data. The results of this are indicated above. However, systems and interfaces often cost more than they should, to build, operate, and maintain. They may also constrain the business rather than support it. A major cause is that the quality of the data models implemented in systems and interfaces is poor".[5]
- "Business rules, specific to how things are done in a particular place, are often fixed in the structure of a data model. This means that small changes in the way business is conducted lead to large changes in computer systems and interfaces".[5]
- "Entity types are often not identified, or incorrectly identified. This can lead to replication of data, data structure, and functionality, together with the attendant costs of that duplication in development and maintenance".[5]
- "Data models for different systems are arbitrarily different. The result of this is that complex interfaces are required between systems that share data. These interfaces can account for between 25–70% of the cost of current systems".[5]
- "Data cannot be shared electronically with customers and suppliers, because the structure and meaning of data has not been standardized. For example, engineering design data and drawings for process plant are still sometimes exchanged on paper".[5]
The reason for these problems is a lack of standards that will ensure that data models will both meet business needs and be consistent.[5]
A data model explicitly determines the structure of data. Typical applications of data models include database models, design of information systems, and enabling exchange of data. Usually, data models are specified in a data modeling language.[3]
Three perspectives
[edit]
A data model instance may be one of three kinds according to ANSI in 1975:[6]
- Conceptual data model: describes the semantics of a domain, being the scope of the model. For example, it may be a model of the interest area of an organization or industry. This consists of entity classes, representing kinds of things of significance in the domain, and relationship assertions about associations between pairs of entity classes. A conceptual schema specifies the kinds of facts or propositions that can be expressed using the model. In that sense, it defines the allowed expressions in an artificial 'language' with a scope that is limited by the scope of the model.
- Logical data model: describes the semantics, as represented by a particular data manipulation technology. This consists of descriptions of tables and columns, object oriented classes, and XML tags, among other things.
- Physical data model: describes the physical means by which data are stored. This is concerned with partitions, CPUs, tablespaces, and the like.
The significance of this approach, according to ANSI, is that it allows the three perspectives to be relatively independent of each other. Storage technology can change without affecting either the logical or the conceptual model. The table/column structure can change without (necessarily) affecting the conceptual model. In each case, of course, the structures must remain consistent with the other model. The table/column structure may be different from a direct translation of the entity classes and attributes, but it must ultimately carry out the objectives of the conceptual entity class structure. Early phases of many software development projects emphasize the design of a conceptual data model. Such a design can be detailed into a logical data model. In later stages, this model may be translated into physical data model. However, it is also possible to implement a conceptual model directly.
History
[edit]One of the earliest pioneering works in modeling information systems was done by Young and Kent (1958),[7][8] who argued for "a precise and abstract way of specifying the informational and time characteristics of a data processing problem". They wanted to create "a notation that should enable the analyst to organize the problem around any piece of hardware". Their work was the first effort to create an abstract specification and invariant basis for designing different alternative implementations using different hardware components. The next step in IS modeling was taken by CODASYL, an IT industry consortium formed in 1959, who essentially aimed at the same thing as Young and Kent: the development of "a proper structure for machine-independent problem definition language, at the system level of data processing". This led to the development of a specific IS information algebra.[8]
In the 1960s data modeling gained more significance with the initiation of the management information system (MIS) concept. According to Leondes (2002), "during that time, the information system provided the data and information for management purposes. The first generation database system, called Integrated Data Store (IDS), was designed by Charles Bachman at General Electric. Two famous database models, the network data model and the hierarchical data model, were proposed during this period of time".[9] Towards the end of the 1960s, Edgar F. Codd worked out his theories of data arrangement, and proposed the relational model for database management based on first-order predicate logic.[10]
In the 1970s entity–relationship modeling emerged as a new type of conceptual data modeling, originally formalized in 1976 by Peter Chen. Entity–relationship models were being used in the first stage of information system design during the requirements analysis to describe information needs or the type of information that is to be stored in a database. This technique can describe any ontology, i.e., an overview and classification of concepts and their relationships, for a certain area of interest.
In the 1970s G.M. Nijssen developed "Natural Language Information Analysis Method" (NIAM) method, and developed this in the 1980s in cooperation with Terry Halpin into Object–Role Modeling (ORM). However, it was Terry Halpin's 1989 PhD thesis that created the formal foundation on which Object–Role Modeling is based.
Bill Kent, in his 1978 book Data and Reality,[11] compared a data model to a map of a territory, emphasizing that in the real world, "highways are not painted red, rivers don't have county lines running down the middle, and you can't see contour lines on a mountain". In contrast to other researchers who tried to create models that were mathematically clean and elegant, Kent emphasized the essential messiness of the real world, and the task of the data modeler to create order out of chaos without excessively distorting the truth.
In the 1980s, according to Jan L. Harrington (2000), "the development of the object-oriented paradigm brought about a fundamental change in the way we look at data and the procedures that operate on data. Traditionally, data and procedures have been stored separately: the data and their relationship in a database, the procedures in an application program. Object orientation, however, combined an entity's procedure with its data."[12]
During the early 1990s, three Dutch mathematicians Guido Bakema, Harm van der Lek, and JanPieter Zwart, continued the development on the work of G.M. Nijssen. They focused more on the communication part of the semantics. In 1997 they formalized the method Fully Communication Oriented Information Modeling FCO-IM.
Types
[edit]Database model
[edit]A database model is a specification describing how a database is structured and used.
Several such models have been suggested. Common models include:
- Flat model
- This may not strictly qualify as a data model. The flat (or table) model consists of a single, two-dimensional array of data elements, where all members of a given column are assumed to be similar values, and all members of a row are assumed to be related to one another.
- Hierarchical model
- The hierarchical model is similar to the network model except that links in the hierarchical model form a tree structure, while the network model allows arbitrary graph.
- Network model
- The network model, also graph model, organizes data using two fundamental constructs, called records and sets. Records (or nodes) contain fields (i.e. attributes), and sets (or edges) define one-to-many, many-to-many and many-to-one relationships between records: one owner, many members. The network data model is an abstraction of the design concept used in the implementation of databases. Network models emphasise interconnectedness, making them ideal for applications where relationships are crucial, like social networks or recommendation systems. This structure allows for efficient querying of relationships without expensive joins.
- Relational model
- is a database model based on first-order predicate logic. Its core idea is to describe a database as a collection of predicates over a finite set of predicate variables, describing constraints on the possible values and combinations of values. The power of the relational data model lies in its mathematical foundations and a simple user-level paradigm.
- Object–relational model
- Similar to a relational database model, but objects, classes, and inheritance are directly supported in database schemas and in the query language.
- Object–role modeling
- A method of data modeling that has been defined as "attribute free", and "fact-based". The result is a verifiably correct system, from which other common artifacts, such as ERD, UML, and semantic models may be derived. Associations between data objects are described during the database design procedure, such that normalization is an inevitable result of the process.
- Star schema
- The simplest style of data warehouse schema. The star schema consists of a few "fact tables" (possibly only one, justifying the name) referencing any number of "dimension tables". The star schema is considered an important special case of the snowflake schema.
-
Concept-oriented model
Data structure diagram
[edit]
A data structure diagram (DSD) is a diagram and data model used to describe conceptual data models by providing graphical notations which document entities and their relationships, and the constraints that bind them. The basic graphic elements of DSDs are boxes, representing entities, and arrows, representing relationships. Data structure diagrams are most useful for documenting complex data entities.
Data structure diagrams are an extension of the entity–relationship model (ER model). In DSDs, attributes are specified inside the entity boxes rather than outside of them, while relationships are drawn as boxes composed of attributes which specify the constraints that bind entities together. DSDs differ from the ER model in that the ER model focuses on the relationships between different entities, whereas DSDs focus on the relationships of the elements within an entity and enable users to fully see the links and relationships between each entity.
There are several styles for representing data structure diagrams, with the notable difference in the manner of defining cardinality. The choices are between arrow heads, inverted arrow heads (crow's feet), or numerical representation of the cardinality.

Entity–relationship model
[edit]An entity–relationship model (ERM), sometimes referred to as an entity–relationship diagram (ERD), could be used to represent an abstract conceptual data model (or semantic data model or physical data model) used in software engineering to represent structured data. There are several notations used for ERMs. Like DSD's, attributes are specified inside the entity boxes rather than outside of them, while relationships are drawn as lines, with the relationship constraints as descriptions on the line. The E-R model, while robust, can become visually cumbersome when representing entities with several attributes.
There are several styles for representing data structure diagrams, with a notable difference in the manner of defining cardinality. The choices are between arrow heads, inverted arrow heads (crow's feet), or numerical representation of the cardinality.
Geographic data model
[edit]A data model in Geographic information systems is a mathematical construct for representing geographic objects or surfaces as data. For example,
- the vector data model represents geography as points, lines, and polygons
- the raster data model represents geography as cell matrixes that store numeric values;
- and the Triangulated irregular network (TIN) data model represents geography as sets of contiguous, nonoverlapping triangles.[14]
-
Groups relate to process of making a map[15]
-
NGMDB data model applications[15]
-
NGMDB databases linked together[15]
-
Representing 3D map information[15]
Generic data model
[edit]Generic data models are generalizations of conventional data models. They define standardized general relation types, together with the kinds of things that may be related by such a relation type. Generic data models are developed as an approach to solving some shortcomings of conventional data models. For example, different modelers usually produce different conventional data models of the same domain. This can lead to difficulty in bringing the models of different people together and is an obstacle for data exchange and data integration. Invariably, however, this difference is attributable to different levels of abstraction in the models and differences in the kinds of facts that can be instantiated (the semantic expression capabilities of the models). The modelers need to communicate and agree on certain elements that are to be rendered more concretely, in order to make the differences less significant.
Semantic data model
[edit]
A semantic data model in software engineering is a technique to define the meaning of data within the context of its interrelationships with other data. A semantic data model is an abstraction that defines how the stored symbols relate to the real world.[13] A semantic data model is sometimes called a conceptual data model.
The logical data structure of a database management system (DBMS), whether hierarchical, network, or relational, cannot totally satisfy the requirements for a conceptual definition of data because it is limited in scope and biased toward the implementation strategy employed by the DBMS. Therefore, the need to define data from a conceptual view has led to the development of semantic data modeling techniques. That is, techniques to define the meaning of data within the context of its interrelationships with other data. As illustrated in the figure. The real world, in terms of resources, ideas, events, etc., are symbolically defined within physical data stores. A semantic data model is an abstraction that defines how the stored symbols relate to the real world. Thus, the model must be a true representation of the real world.[13]
Topics
[edit]Data architecture
[edit]Data architecture is the design of data for use in defining the target state and the subsequent planning needed to hit the target state. It is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture.
A data architecture describes the data structures used by a business and/or its applications. There are descriptions of data in storage and data in motion; descriptions of data stores, data groups, and data items; and mappings of those data artifacts to data qualities, applications, locations, etc.
Essential to realizing the target state, Data architecture describes how data is processed, stored, and utilized in a given system. It provides criteria for data processing operations that make it possible to design data flows and also control the flow of data in the system.
Data modeling
[edit]
Data modeling in software engineering is the process of creating a data model by applying formal data model descriptions using data modeling techniques. Data modeling is a technique for defining business requirements for a database. It is sometimes called database modeling because a data model is eventually implemented in a database.[16]
The figure illustrates the way data models are developed and used today. A conceptual data model is developed based on the data requirements for the application that is being developed, perhaps in the context of an activity model. The data model will normally consist of entity types, attributes, relationships, integrity rules, and the definitions of those objects. This is then used as the start point for interface or database design.[5]
Data properties
[edit]
Some important properties of data for which requirements need to be met are:
- definition-related properties[5]
- relevance: the usefulness of the data in relation to its intended purpose or application.
- clarity: the availability of a clear and shared definition for the data.
- consistency: the compatibility of the same type of data from different sources.
- content-related properties
- timeliness: the availability of data at the time required and how up-to-date that data is.
- accuracy: how close to the truth the data is.
- properties related to both definition and content
- completeness: how much of the required data is available.
- accessibility: where, how, and to whom the data is available or not available (e.g. security).
- cost: the cost incurred in obtaining the data, and making it available for use.
Data organization
[edit]Another kind of data model describes how to organize data using a database management system or other data management technology. It describes, for example, relational tables and columns or object-oriented classes and attributes. Such a data model is sometimes referred to as the physical data model, but in the original ANSI three schema architecture, it is called "logical". In that architecture, the physical model describes the storage media (cylinders, tracks, and tablespaces). Ideally, this model is derived from the more conceptual data model described above. It may differ, however, to account for constraints like processing capacity and usage patterns.
While data analysis is a common term for data modeling, the activity actually has more in common with the ideas and methods of synthesis (inferring general concepts from particular instances) than it does with analysis (identifying component concepts from more general ones). {Presumably we call ourselves systems analysts because no one can say systems synthesists.} Data modeling strives to bring the data structures of interest together into a cohesive, inseparable, whole by eliminating unnecessary data redundancies and by relating data structures with relationships.
A different approach is to use adaptive systems such as artificial neural networks that can autonomously create implicit models of data.
Data structure
[edit]
A data structure is a way of storing data in a computer so that it can be used efficiently. It is an organization of mathematical and logical concepts of data. Often a carefully chosen data structure will allow the most efficient algorithm to be used. The choice of the data structure often begins from the choice of an abstract data type.
A data model describes the structure of the data within a given domain and, by implication, the underlying structure of that domain itself. This means that a data model in fact specifies a dedicated grammar for a dedicated artificial language for that domain. A data model represents classes of entities (kinds of things) about which a company wishes to hold information, the attributes of that information, and relationships among those entities and (often implicit) relationships among those attributes. The model describes the organization of the data to some extent irrespective of how data might be represented in a computer system.
The entities represented by a data model can be the tangible entities, but models that include such concrete entity classes tend to change over time. Robust data models often identify abstractions of such entities. For example, a data model might include an entity class called "Person", representing all the people who interact with an organization. Such an abstract entity class is typically more appropriate than ones called "Vendor" or "Employee", which identify specific roles played by those people.
-
Array
Data model theory
[edit]The term data model can have two meanings:[17]
- A data model theory, i.e. a formal description of how data may be structured and accessed.
- A data model instance, i.e. applying a data model theory to create a practical data model instance for some particular application.
A data model theory has three main components:[17]
- The structural part: a collection of data structures which are used to create databases representing the entities or objects modeled by the database.
- The integrity part: a collection of rules governing the constraints placed on these data structures to ensure structural integrity.
- The manipulation part: a collection of operators which can be applied to the data structures, to update and query the data contained in the database.
For example, in the relational model, the structural part is based on a modified concept of the mathematical relation; the integrity part is expressed in first-order logic and the manipulation part is expressed using the relational algebra, tuple calculus and domain calculus.
A data model instance is created by applying a data model theory. This is typically done to solve some business enterprise requirement. Business requirements are normally captured by a semantic logical data model. This is transformed into a physical data model instance from which is generated a physical database. For example, a data modeler may use a data modeling tool to create an entity–relationship model of the corporate data repository of some business enterprise. This model is transformed into a relational model, which in turn generates a relational database.
Patterns
[edit]Patterns[18] are common data modeling structures that occur in many data models.
Related models
[edit]Data-flow diagram
[edit]
A data-flow diagram (DFD) is a graphical representation of the "flow" of data through an information system. It differs from the flowchart as it shows the data flow instead of the control flow of the program. A data-flow diagram can also be used for the visualization of data processing (structured design). Data-flow diagrams were invented by Larry Constantine, the original developer of structured design,[20] based on Martin and Estrin's "data-flow graph" model of computation.
It is common practice to draw a context-level data-flow diagram first which shows the interaction between the system and outside entities. The DFD is designed to show how a system is divided into smaller portions and to highlight the flow of data between those parts. This context-level data-flow diagram is then "exploded" to show more detail of the system being modeled
Information model
[edit]
An Information model is not a type of data model, but more or less an alternative model. Within the field of software engineering, both a data model and an information model can be abstract, formal representations of entity types that include their properties, relationships and the operations that can be performed on them. The entity types in the model may be kinds of real-world objects, such as devices in a network, or they may themselves be abstract, such as for the entities used in a billing system. Typically, they are used to model a constrained domain that can be described by a closed set of entity types, properties, relationships and operations.
According to Lee (1999)[21] an information model is a representation of concepts, relationships, constraints, rules, and operations to specify data semantics for a chosen domain of discourse. It can provide sharable, stable, and organized structure of information requirements for the domain context.[21] More in general the term information model is used for models of individual things, such as facilities, buildings, process plants, etc. In those cases the concept is specialised to Facility Information Model, Building Information Model, Plant Information Model, etc. Such an information model is an integration of a model of the facility with the data and documents about the facility.
An information model provides formalism to the description of a problem domain without constraining how that description is mapped to an actual implementation in software. There may be many mappings of the information model. Such mappings are called data models, irrespective of whether they are object models (e.g. using UML), entity–relationship models or XML schemas.

Object model
[edit]An object model in computer science is a collection of objects or classes through which a program can examine and manipulate some specific parts of its world. In other words, the object-oriented interface to some service or system. Such an interface is said to be the object model of the represented service or system. For example, the Document Object Model (DOM) [1] is a collection of objects that represent a page in a web browser, used by script programs to examine and dynamically change the page. There is a Microsoft Excel object model[22] for controlling Microsoft Excel from another program, and the ASCOM Telescope Driver[23] is an object model for controlling an astronomical telescope.
In computing the term object model has a distinct second meaning of the general properties of objects in a specific computer programming language, technology, notation or methodology that uses them. For example, the Java object model, the COM object model, or the object model of OMT. Such object models are usually defined using concepts such as class, message, inheritance, polymorphism, and encapsulation. There is an extensive literature on formalized object models as a subset of the formal semantics of programming languages.
Object–role modeling
[edit]
Object–Role Modeling (ORM) is a method for conceptual modeling, and can be used as a tool for information and rules analysis.[25]
Object–Role Modeling is a fact-oriented method for performing systems analysis at the conceptual level. The quality of a database application depends critically on its design. To help ensure correctness, clarity, adaptability and productivity, information systems are best specified first at the conceptual level, using concepts and language that people can readily understand.
The conceptual design may include data, process and behavioral perspectives, and the actual DBMS used to implement the design might be based on one of many logical data models (relational, hierarchic, network, object-oriented, etc.).[26]
Unified Modeling Language models
[edit]The Unified Modeling Language (UML) is a standardized general-purpose modeling language in the field of software engineering. It is a graphical language for visualizing, specifying, constructing, and documenting the artifacts of a software-intensive system. The Unified Modeling Language offers a standard way to write a system's blueprints, including:[27]
- Conceptual things such as business processes and system functions
- Concrete things such as programming language statements, database schemas, and
- Reusable software components.
UML offers a mix of functional models, data models, and database models.
See also
[edit]References
[edit]- ^ Paul R. Smith & Richard Sarfaty Publications, LLC 2009
- ^ "What is a Data Model?". princeton.edu. Retrieved 29 May 2024.
- ^ "UML Domain Modeling - Stack Overflow". Stack Overflow. Stack Exchange Inc. Retrieved 4 February 2017.
- ^ Michael R. McCaleb (1999). "A Conceptual Data Model of Datum Systems" Archived 2008-09-21 at the Wayback Machine. National Institute of Standards and Technology. August 1999.
- ^ a b c d e f g h i j k Matthew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Executive (EPISTLE).
- ^ American National Standards Institute. 1975. ANSI/X3/SPARC Study Group on Data Base Management Systems; Interim Report. FDT (Bulletin of ACM SIGMOD) 7:2.
- ^ Young, J. W., and Kent, H. K. (1958). "Abstract Formulation of Data Processing Problems". In: Journal of Industrial Engineering. Nov-Dec 1958. 9(6), pp. 471–479
- ^ a b Janis A. Bubenko jr (2007) "From Information Algebra to Enterprise Modelling and Ontologies - a Historical Perspective on Modelling for Information Systems". In: Conceptual Modelling in Information Systems Engineering. John Krogstie et al. eds. pp 1–18
- ^ Cornelius T. Leondes (2002). Database and Data Communication Network Systems: Techniques and Applications. Page 7
- ^ "Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks", E.F. Codd, IBM Research Report, 1969
- ^ Data and Reality
- ^ Jan L. Harrington (2000). Object-oriented Database Design Clearly Explained. p.4
- ^ a b c d FIPS Publication 184 Archived 2013-12-03 at the Wayback Machine released of IDEF1X by the Computer Systems Laboratory of the National Institute of Standards and Technology (NIST). 21 December 1993 (withdrawn in 2008).
- ^ Wade, T. and Sommer, S. eds. A to Z GIS
- ^ a b c d David R. Soller1 and Thomas M. Berg (2003). The National Geologic Map Database Project: Overview and Progress U.S. Geological Survey Open-File Report 03–471.
- ^ Whitten, Jeffrey L.; Lonnie D. Bentley, Kevin C. Dittman. (2004). Systems Analysis and Design Methods. 6th edition. ISBN 0-256-19906-X.
- ^ a b Beynon-Davies P. (2004). Database Systems 3rd Edition. Palgrave, Basingstoke, UK. ISBN 1-4039-1601-2
- ^ "The Data Model Resource Book: Universal Patterns for Data Modeling" Len Silverstone & Paul Agnew (2008).
- ^ John Azzolini (2000). Introduction to Systems Engineering Practices. July 2000.
- ^ W. Stevens, G. Myers, L. Constantine, "Structured Design", IBM Systems Journal, 13 (2), 115–139, 1974.
- ^ a b Y. Tina Lee (1999). "Information modeling from design to implementation" National Institute of Standards and Technology.
- ^ Excel Object Model Overview
- ^ "ASCOM General Requirements". 2011-05-13. Retrieved 2014-09-25.
- ^ Stephen M. Richard (1999). Geologic Concept Modeling. U.S. Geological Survey Open-File Report 99–386.
- ^ Joachim Rossberg and Rickard Redler (2005). Pro Scalable .NET 2.0 Application Designs.. Page 27.
- ^ Object Role Modeling: An Overview (msdn.microsoft.com). Retrieved 19 September 2008.
- ^ Grady Booch, Ivar Jacobson & Jim Rumbaugh (2005) OMG Unified Modeling Language Specification.
Further reading
[edit]- David C. Hay (1996). Data Model Patterns: Conventions of Thought. New York:Dorset House Publishers, Inc.
- Len Silverston (2001). The Data Model Resource Book Volume 1/2. John Wiley & Sons.
- Len Silverston & Paul Agnew (2008). The Data Model Resource Book: Universal Patterns for data Modeling Volume 3. John Wiley & Sons.
- Matthew West (2011) Developing High Quality Data Models Morgan Kaufmann
Data model
View on GrokipediaIntroduction
Definition and Purpose
A data model is an abstract framework that defines the structure, organization, and relationships of data within a system, serving as a blueprint for how information is represented and manipulated. According to E.F. Codd, a foundational figure in database theory, a data model consists of three core components: a collection of data structure types that form the building blocks of the database, a set of operators or inferencing rules for retrieving and deriving data, and a collection of integrity rules to ensure consistent states and valid changes.[6] This conceptualization bridges the gap between real-world entities and their digital counterparts, providing a conceptual toolset for describing entities, attributes, and interrelationships in a standardized manner.[7] The primary purposes of a data model include facilitating clear communication among diverse stakeholders—such as business analysts, developers, and end-users—by offering a shared vocabulary and visual representation of data requirements during system analysis.[8] It ensures data integrity by enforcing constraints and rules that maintain accuracy, consistency, and reliability across the dataset, while supporting scalability through adaptable structures that accommodate growth and evolution with minimal disruption to existing applications.[6] Additionally, data models enable efficient querying and analysis by defining operations that optimize data access and manipulation, laying the groundwork for high-level languages and database management system architectures.[7] In practice, data models abstract complex real-world phenomena into manageable formats, finding broad applications in databases for persistent storage, software engineering for system design, and business intelligence for deriving insights from structured information.[7] For instance, they help translate organizational needs into technical specifications, such as modeling customer interactions in a retail system or inventory relationships in supply chain software. Originating from mathematical set theory and adapted for computational environments, data models provide levels of abstraction akin to the three-schema architecture, which separates user views from physical storage.[9][8]Three-Schema Architecture
The three-schema architecture, proposed by the ANSI/X3/SPARC Study Group on Database Management Systems, organizes database systems into three distinct levels of abstraction to manage data representation and access efficiently. This framework separates user interactions from the underlying data storage, promoting modularity and maintainability in database design. At the external level, also known as the view level, the architecture defines user-specific schemas that present customized subsets of the data tailored to individual applications or user groups. These external schemas hide irrelevant details and provide a simplified, application-oriented perspective, such as predefined queries or reports, without exposing the full database structure. The conceptual level, or logical level, describes the overall logical structure of the entire database in a storage-independent manner, including entities, relationships, constraints, and data types that represent the community's view of the data. It serves as a unified model for the database content, independent of physical implementation. Finally, the internal level, or physical level, specifies the physical storage details, such as file organizations, indexing strategies, access paths, and data compression methods, optimizing performance on specific hardware. The architecture facilitates two key mappings to ensure consistency across levels: the external/conceptual mapping, which translates user views into the logical schema, and the conceptual/internal mapping, which defines how the logical structure is implemented physically. These mappings allow transformations, such as view derivations or storage optimizations, to maintain data integrity without redundant storage or direct user exposure to changes in other levels. By decoupling these layers, the framework achieves logical data independence—changes to the conceptual schema do not affect external views—and physical data independence—modifications to internal storage do not impact the conceptual or external levels. This separation reduces system complexity, enhances security by limiting user access to necessary views, and supports scalability in multi-user environments. Originally outlined in the 1975 interim report, the three-schema architecture remains a foundational standard influencing modern database management systems (DBMS), where principles of layered abstraction underpin features like views in relational databases and schema evolution in distributed systems.Historical Development
Early Mathematical Foundations
The foundations of data modeling trace back to 19th-century mathematical developments, particularly set theory, which provided the abstract framework for organizing and relating elements without reference to physical implementation. Georg Cantor, in his pioneering work starting in 1872, formalized sets as collections of distinct objects, introducing concepts such as cardinality to compare sizes of infinite collections and equivalence relations to partition sets into subsets with shared properties.[10] These abstractions laid the groundwork for viewing data as structured collections, where relations could be defined as subsets of Cartesian products of sets, enabling the representation of dependencies and mappings between entities. Cantor's 1883 publication Grundlagen einer allgemeinen Mannigfaltigkeitslehre further developed transfinite ordinals and power sets, emphasizing hierarchical and relational structures that would later inform data organization.[11] Parallel advancements in logic provided precursors to relational algebra, beginning with George Boole's 1847 treatise The Mathematical Analysis of Logic, which applied algebraic operations to logical classes. Boole represented classes as variables and defined operations like intersection (multiplication ) and union (addition ) under laws of commutativity and distributivity, allowing equational expressions for propositions such as "All X is Y" as .[12] This Boolean algebra enabled the manipulation of relations between classes as abstract descriptors, forming a basis for querying and transforming data sets through logical operations. Building on this, Giuseppe Peano in the late 19th century contributed to predicate logic by standardizing notation for quantification and logical connectives in his 1889 Arithmetices principia, facilitating precise expressions of properties and relations over mathematical objects.[13][14] Early 20th-century logicians extended these ideas by formalizing relations and entities more rigorously. Gottlob Frege's 1879 Begriffsschrift introduced predicate calculus, treating relations as functions that map arguments to truth values—for instance, a binary relation like "loves" as a function from pairs of entities to the truth value "The True."[15] This approach distinguished concepts (unsaturated functions) from objects (saturated entities), providing a blueprint for entity-relationship modeling where data elements are linked via functional dependencies. Bertrand Russell advanced this in The Principles of Mathematics (1903), analyzing relations as fundamental to mathematical structures and developing type theory to handle relational orders without paradoxes, emphasizing that mathematics concerns relational patterns rather than isolated objects.[16] Mathematical abstractions of graphs and trees, emerging in the 19th century, offered additional tools for representing hierarchical and networked data. Leonhard Euler's 1736 solution to the Königsberg bridge problem implicitly used graph-like structures to model connectivity, but systematic development came with Arthur Cayley's 1857 enumeration of trees as rooted, acyclic graphs with labeled instances for vertices.[17] Gustav Kirchhoff's 1847 work on electrical networks formalized trees as spanning subgraphs minimizing connections, highlighting their role in describing minimal relational paths. These concepts treated data as nodes and edges without computational context, focusing on topological properties like paths and cycles. Abstract descriptors such as tuples, relations, and functions crystallized in 19th-century mathematics as tools for precise data specification. Tuples, as ordered sequences of elements, emerged from Cantor's work on mappings.[10] Relations were codified as subsets of product sets, as in De Morgan's 1860 calculus of relations, which treated binary relations as compositions of functions between classes.[18] Functions, formalized by Dirichlet in 1837 as arbitrary mappings from one set to another, provided a unidirectional relational model, independent of analytic expressions. These elements—tuples for bundling attributes, relations for associations, and functions for transformations—served as purely theoretical constructs for describing data structures. In the 1940s and 1950s, these mathematical ideas began informing initial data representation in computing, as abstractions like sets for memory collections and graphs for data flows influenced designs such as Alan Turing's 1945 Automatic Computing Engine, which used structured addressing akin to tree hierarchies for organizing binary data.[19] This transition marked the shift from pure theory to practical abstraction, where logical relations and set operations guided early conceptualizations of data storage and retrieval.Evolution in Computing and Databases
In the 1950s and early 1960s, data management in computing relied primarily on file-based systems, where data was stored in sequential or indexed files on magnetic tapes or disks, often customized for specific applications without standardized structures for sharing across programs.[20] These systems, prevalent in early mainframes like the IBM 1401, lacked efficient querying and required programmers to navigate data manually via application code, leading to redundancy and maintenance challenges.[21] A pivotal advancement came in 1966 with IBM's Information Management System (IMS), developed for NASA's Apollo program to handle hierarchical data structures resembling organizational charts or bill-of-materials.[22] IMS organized data into tree-like hierarchies with parent-child relationships, enabling faster access for transactional processing but limiting flexibility for complex many-to-many associations.[4] This hierarchical model influenced early database management systems (DBMS) by introducing segmented storage and navigational access methods.[23] By the late 1960s, the limitations of hierarchical models prompted the development of network models. In 1971, the Conference on Data Systems Languages (CODASYL) Database Task Group (DBTG) released specifications for a network data model, allowing records to participate in multiple parent-child sets for more general graph-like structures.[24] Implemented in systems like Integrated Data Store (IDS), this model supported pointer-based navigation but required complex schema definitions and low-level programming, complicating maintenance.[25] The relational model marked a revolutionary shift in the 1970s. In 1970, Edgar F. Codd published "A Relational Model of Data for Large Shared Data Banks," proposing data organization into tables (relations) with rows and columns, using keys for integrity and relational algebra—building on mathematical set theory—for declarative querying independent of physical storage.[5] This abstraction from navigational access to set-based operations addressed data independence, reducing application dependencies on storage details.[26] To operationalize relational concepts, query languages emerged. In 1974, Donald D. Chamberlin and Raymond F. Boyce developed SEQUEL (later SQL) as part of IBM's System R prototype, providing a structured English-like syntax for data manipulation and retrieval in relational databases.[27] SQL's declarative nature allowed users to specify what data they wanted without how to retrieve it, facilitating broader adoption.[28] Conceptual modeling also advanced with Peter Pin-Shan Chen's 1976 entity-relationship (ER) model, which formalized diagrams for entities, attributes, and relationships to bridge user requirements and database design.[29] Widely used for schema planning, the ER model complemented relational implementations by emphasizing semantics.[30] The 1980s saw commercialization and standardization. SQL was formalized as ANSI X3.135 in 1986, establishing a portable query standard across vendors and enabling interoperability.[31] IBM released DB2 in 1983 as a production relational DBMS for mainframes, supporting SQL and transactions for enterprise workloads.[32] Oracle followed in 1979 with Version 2, the first commercial SQL relational DBMS, emphasizing portability across hardware.[33] The 1990s extended relational paradigms to object-oriented needs. In 1993, the Object Data Management Group (ODMG) published ODMG-93, standardizing object-oriented DBMS with Object Definition Language (ODL) for schemas, Object Query Language (OQL) for queries, and bindings to languages like C++.[34] This addressed complex data like multimedia by integrating objects with relational persistence.[35] Overall, this era transitioned from rigid, navigational file and hierarchical/network systems to flexible, declarative relational models, underpinning modern DBMS through data independence and standardization.[21]Types of Data Models
Hierarchical and Network Models
The hierarchical data model organizes data in a tree-like structure, where each record, known as a segment in systems like IBM's Information Management System (IMS), has a single parent but can have multiple children, establishing one-to-many relationships.[36] In IMS, the root segment serves as the top-level parent with one occurrence per database record, while child segments—such as those representing illnesses or treatments under a patient record—can occur multiply based on non-unique keys like dates, enabling ordered storage in ascending sequence for efficient sequential access.[36] This structure excels in representing naturally ordered data, such as file systems or organizational charts, where predefined paths facilitate straightforward navigation from parent to child.[37] However, the hierarchical model is limited in supporting many-to-many relationships, as it enforces strict one-to-many links without native mechanisms for multiple parents, often requiring redundant segments as workarounds that increase storage inefficiency.[36] Access relies on procedural navigation, traversing fixed hierarchical paths sequentially, which suits simple queries but becomes cumbersome for complex retrievals involving non-linear paths.[37] The network data model, standardized by the Conference on Data Systems Languages (CODASYL) in the early 1970s, extends this by representing data as records connected through sets, allowing more flexible graph-like topologies.[38] A set defines a named relationship between one owner record type and one or more member record types, where the owner acts as a parent to multiple members, and members can belong to multiple sets, supporting many-to-one or many-to-many links via pointer chains or rings.[39] For instance, a material record might serve as a member in sets owned by different components like cams or gears, enabling complex interlinks; implementation typically uses forward and backward pointers to traverse these relations efficiently within a set.[38] Access in CODASYL systems, such as through Data Manipulation Language (DML) commands like FIND NEXT or FIND OWNER, remains procedural, navigating via these links.[24] While the network model overcomes the hierarchical model's restriction to single-parentage by permitting records to have multiple owners, both approaches share reliance on procedural navigation, requiring explicit path traversal that leads to query inefficiencies, such as sequential pointer following for ad-hoc retrievals across multiple sets.[24] These models dominated database systems on mainframes during the 1960s and 1970s, with IMS developed by IBM in 1966 for Apollo program inventory tracking and CODASYL specifications emerging from 1969 reports to standardize network structures.[40] Widely adopted in industries like manufacturing and aerospace for their performance in structured, high-volume transactions, they persist as legacy systems in some enterprises but have influenced modern hierarchical representations in formats like XML and JSON, which adopt tree-based nesting for semi-structured data.[41][42]Relational Model
The relational model, introduced by Edgar F. Codd in 1970, represents data as a collection of relations, each consisting of tuples organized into attributes, providing a declarative framework for database design that emphasizes logical structure over physical implementation.[5] A relation is mathematically equivalent to a set of tuples, where each tuple is an ordered list of values corresponding to the relation's attributes, ensuring no duplicate tuples exist to maintain set semantics.[5] Attributes define the domains of possible values, typically atomic to adhere to first normal form, while primary keys uniquely identify each tuple within a relation, and foreign keys enforce referential integrity by linking tuples across relations through shared values.[5] Relational algebra serves as the formal query foundation of the model, comprising a set of operations on relations that produce new relations, enabling precise data manipulation without specifying access paths.[5] Key operations include selection (), which filters tuples satisfying a condition, expressed as where is a relation and is a predicate on attributes; for example, retrieves all employee tuples where age exceeds 30.[5] Projection () extracts specified attributes, eliminating duplicates, as in to obtain unique names and salaries.[5] Join () combines relations based on a condition, such as to match related tuples from and on a shared identifier.[5] Other fundamental operations are union (), merging compatible relations while removing duplicates, and difference (), yielding tuples in one relation but not another, both preserving relational structure.[5] These operations are closed, compositional, and form a complete query language when including rename () for attribute relabeling.[5] Normalization theory addresses redundancy and anomaly prevention by decomposing relations into smaller, dependency-preserving forms based on functional dependencies (FDs), where an FD indicates that attribute set uniquely determines .[43] First normal form (1NF) requires atomic attribute values and no repeating groups, ensuring each tuple holds indivisible entries.[43] Second normal form (2NF) builds on 1NF by eliminating partial dependencies, where non-prime attributes depend fully on the entire primary key, not subsets.[43] Third normal form (3NF) further removes transitive dependencies, mandating that non-prime attributes depend only on candidate keys.[43] Boyce-Codd normal form (BCNF) strengthens 3NF by requiring every determinant to be a candidate key, resolving certain irreducibility issues while aiming to preserve all FDs without lossy joins.[43] The model's advantages include data independence, separating logical schema from physical storage to allow modifications without application changes, and support for ACID properties—atomicity, consistency, isolation, durability—in transaction processing to ensure reliable concurrent access.[5][44] SQL (Structured Query Language), developed as a practical interface, translates relational algebra into user-friendly declarative statements for querying and manipulation. However, the model faces limitations in natively representing complex, nested objects like multimedia or hierarchical structures, often requiring denormalization or extensions that compromise purity.[45]Object-Oriented and NoSQL Models
The object-oriented data model extends traditional data modeling by incorporating object-oriented programming principles, such as classes, inheritance, and polymorphism, to represent both data and behavior within a unified structure.[46] In this model, data is stored as objects that encapsulate attributes and methods, allowing for complex relationships like inheritance hierarchies where subclasses inherit properties from parent classes, and polymorphism enables objects of different classes to be treated uniformly through common interfaces.[47] The Object Data Management Group (ODMG) standard, particularly ODMG 3.0, formalized these concepts by defining a core object model, object definition language (ODL), and bindings for languages like C++ and Java, ensuring portability across object database systems.[48] This integration facilitates seamless persistence of objects from object-oriented languages, such as Java, where developers can store and retrieve class instances directly without manual mapping to relational tables, reducing impedance mismatch in applications involving complex entities like multimedia or CAD designs.[49] For instance, Java objects adhering to ODMG can be persisted using standard APIs that abstract underlying storage, supporting operations like traversal of inheritance trees and dynamic method invocation.[50] NoSQL models emerged in the 2000s to address relational models' limitations in scalability and schema rigidity for unstructured or semi-structured data in distributed environments, prioritizing horizontal scaling over strict ACID compliance.[51] These models encompass several variants, including document stores, key-value stores, column-family stores, and graph databases, each optimized for specific data access patterns in big data scenarios. Document-oriented NoSQL databases store data as self-contained, schema-flexible documents, often in JSON-like formats, enabling nested structures and varying fields per document to handle diverse, evolving data without predefined schemas.[51] MongoDB exemplifies this approach, using BSON (Binary JSON) documents that support indexing on embedded fields and aggregation pipelines for querying hierarchical data, making it suitable for content management and real-time analytics.[51] Key-value stores provide simple, high-performance access to data via unique keys mapping to opaque values, ideal for caching and session management where fast lookups predominate over complex joins.[52] Redis, a prominent key-value system, supports data structures like strings, hashes, and lists as values, with in-memory storage for sub-millisecond latencies and persistence options for durability.[52] Column-family (or wide-column) stores organize data into rows with dynamic columns grouped into families, allowing sparse, variable schemas across large-scale distributed tables to manage high-velocity writes and reads.[51] Apache Cassandra, for example, uses a sorted map of column families per row key, enabling tunable consistency and linear scalability across clusters for time-series data and IoT applications.[52] Graph models within NoSQL represent data as nodes (entities), edges (relationships), and properties (attributes on nodes or edges), excelling in scenarios requiring traversal of interconnected data like recommendations or fraud detection.[53] Neo4j implements the property graph model, where nodes and directed edges carry key-value properties, and supports the Cypher query language for pattern matching, such as finding shortest paths in social networks via declarative syntax likeMATCH (a:Person)-[:FRIENDS_WITH*1..3]-(b:Person) RETURN a, b.[54]
A key trade-off in NoSQL models, particularly in distributed systems, is balancing scalability against consistency, as articulated by the CAP theorem, which posits that a system can only guarantee two of three properties: Consistency (all nodes see the same data), Availability (every request receives a response), and Partition tolerance (the system continues operating despite network partitions).[55] Many NoSQL databases, like Cassandra, favor availability and partition tolerance (AP systems) with eventual consistency, using mechanisms such as quorum reads to reconcile updates, while graph stores like Neo4j often prioritize consistency for accurate traversals at the cost of availability during partitions.[56]
Semantic and Specialized Models
The entity-relationship (ER) model is a conceptual data model that represents data in terms of entities, attributes, and relationships to capture the semantics of an information system.[29] Entities are objects or things in the real world with independent existence, such as "Employee" or "Department," each described by attributes like name or ID.[29] Relationships define associations between entities, such as "works in," with cardinality constraints specifying participation ratios: one-to-one (1:1), one-to-many (1:N), or many-to-many (N:M).[29] This model facilitates the design of relational databases by mapping entities to tables, attributes to columns, and relationships to foreign keys or junction tables.[29] Semantic models extend data representation by emphasizing meaning and logical inference, enabling knowledge sharing across systems. The Resource Description Framework (RDF) structures data as triples consisting of a subject (resource), predicate (property), and object (value or resource), forming directed graphs for linked data.[57] RDF supports interoperability on the web by allowing statements like "Paris (subject) isCapitalOf (predicate) France (object)."[57] Ontologies built on RDF, such as those using the Web Ontology Language (OWL), define classes, properties, and axioms for reasoning, including subclass relationships and equivalence classes to infer new knowledge.[58] OWL enables automated inference, such as deducing that if "Cat" is a subclass of "Mammal" and "Mammal" has property "breathes air," then instances of "Cat" inherit that property.[58] Geographic data models specialize in representing spatial information for geographic information systems (GIS). The vector model uses discrete geometric primitives—points for locations, lines for paths, and polygons for areas—to depict features like cities or rivers, with coordinates defining their positions.[59] In contrast, the raster model organizes data into a grid of cells (pixels), each holding a value for continuous phenomena like elevation or temperature, suitable for analysis over large areas.[59] Spatial relationships, such as topology, capture connectivity and adjacency (e.g., shared boundaries between polygons) in systems like ArcGIS, enabling operations like overlay analysis.[59] Generic models provide abstraction for diverse domains, often serving as bridges to implementation. Unified Modeling Language (UML) class diagrams model static structures with classes (entities), attributes, and associations, offering a visual notation for object-oriented design across software systems.[60] For semi-structured data, XML Schema defines document structures, elements, types, and constraints using XML syntax, ensuring validation of hierarchical formats.[61] Similarly, JSON Schema specifies the structure of JSON documents through keywords like "type," "properties," and "required," supporting validation for web APIs and configuration files.[62] These models uniquely incorporate inference rules and domain-specific constraints to enforce semantics beyond basic structure. In semantic models, OWL's description logic allows rule-based deduction, such as transitive properties for "partOf" relations.[58] Geographic models apply constraints like topological consistency (e.g., no overlapping polygons without intersection) and operations such as spatial joins, which combine datasets based on proximity or containment to derive new insights, like aggregating population within flood zones.[63] In conceptual design, they link high-level semantics to the three-schema architecture by refining user views into logical schemas.[29]Core Concepts
Data Modeling Process
The data modeling process is a structured workflow that transforms business requirements into a blueprint for data storage and management, ensuring alignment with organizational needs and system efficiency. It typically unfolds in sequential yet iterative phases, beginning with understanding the domain and culminating in a deployable database schema. This methodology supports forward engineering, where models are built from abstract concepts to concrete implementations, and backward engineering, where existing databases are analyzed to generate or refine models. Tools such as ER/Studio or erwin Data Modeler facilitate these techniques by automating diagram generation, schema validation, and iterative refinements through visual interfaces and scripting capabilities.[2][64] The initial phase, requirements analysis, involves gathering and documenting business rules, user needs, and data flows through interviews, workshops, and documentation review. Stakeholders, including business analysts and end-users, play a critical role in this stage to capture accurate domain knowledge and resolve early ambiguities, such as unclear entity definitions or conflicting rules, preventing downstream rework. This phase establishes the foundation for subsequent modeling by identifying key entities, processes, and constraints without delving into technical details.[2][65] Following requirements analysis, conceptual modeling creates a high-level abstraction of the data structure, often using entity-relationship diagrams to depict entities, attributes, and relationships in business terms. This phase focuses on clarity and completeness, avoiding implementation specifics to communicate effectively with non-technical audiences. It serves as a bridge to more detailed designs, emphasizing iterative feedback to refine the model based on stakeholder validation.[2] In the logical design phase, the conceptual model is refined into a detailed schema that specifies data types, keys, and relationships while applying techniques like normalization to eliminate redundancies and ensure data integrity. Normalization, a core aspect of relational model development, organizes data into tables to minimize anomalies during operations. This step produces a technology-agnostic model ready for physical implementation, with tools enabling automated checks for consistency.[2] The physical design phase translates the logical model into a database-specific implementation, incorporating elements like indexing for query optimization, partitioning for large-scale data distribution, and storage parameters tailored to the chosen database management system. Considerations for performance, such as denormalization in read-heavy scenarios, ensure scalability as data volumes grow, balancing query speed against maintenance complexity. Iterative refinement here involves prototyping and testing to validate against real-world loads.[2][65] Best practices throughout the process emphasize continuous stakeholder involvement to maintain alignment with evolving business needs and to handle ambiguities through prototyping or sample data analysis. Ensuring scalability involves anticipating data growth by designing flexible structures, such as modular entities that support future extensions without major overhauls. Model quality can be assessed using metrics like cohesion, which measures how well entities capture cohesive business concepts, and coupling, which evaluates the degree of inter-entity dependencies to promote maintainability.[65] Common pitfalls include overlooking constraints like referential integrity rules, which can lead to data inconsistencies, or ignoring projected data volume growth, resulting in performance bottlenecks. To mitigate these, practitioners recommend regular validation cycles and documentation of assumptions, fostering robust models that support long-term system reliability.[65]Key Properties and Patterns
Data models incorporate several core properties to ensure reliability and robustness in representing and managing information. Entity integrity requires that each row in a table can be uniquely identified by its primary key, preventing duplicate or null values in key fields to maintain distinct entities. Referential integrity enforces that foreign key values in one table match primary key values in another or are null, preserving valid relationships across tables. Consistency is achieved through ACID properties in transactional systems, where atomicity ensures operations complete fully or not at all, isolation prevents interference between concurrent transactions, and durability guarantees committed changes persist despite failures. Security in data models involves access controls, such as role-based mechanisms that restrict user permissions to read, write, or modify specific data elements based on predefined policies. Extensibility allows data models to accommodate new attributes or structures without disrupting existing functionality, often through modular designs that support future enhancements. Data organization within models relies on foundational structures to optimize storage and retrieval. Arrays provide sequential access for ordered collections, trees enable hierarchical relationships for nested data like organizational charts, and hashes facilitate fast lookups via key-value pairs in associative storage. These structures underpin properties like atomicity, which treats data operations as indivisible units, and durability, which ensures data survives system failures through mechanisms like logging or replication. Common design patterns in data modeling promote reusability and efficiency. The singleton pattern ensures a single instance for unique entities, such as a global configuration table, avoiding redundancy. Factory patterns create complex objects, like generating entity instances based on type specifications in object-oriented models. Adapter patterns integrate legacy systems by wrapping incompatible interfaces, enabling seamless data exchange without overhaul. Anti-patterns, such as god objects—overly centralized entities handling multiple responsibilities—can lead to maintenance issues and reduced scalability by violating separation of concerns. Evaluation of data models focuses on criteria like completeness, which assesses whether all necessary elements are represented without omissions; minimality, ensuring no redundant or extraneous components; and understandability, measuring how intuitively the model conveys structure and relationships to stakeholders. Tools like Data Vault 2.0 apply these patterns through hubs for core business keys, links for relationships, and satellites for descriptive attributes, facilitating scalable and auditable designs. Normalization forms serve as a tool to enforce properties like minimality by reducing redundancy in relational models.Theoretical Foundations
The theoretical foundations of data models rest on mathematical structures from set theory, logic, and algebra, providing a rigorous basis for defining, querying, and constraining data representations. In the relational paradigm, the formal theory distinguishes between relational algebra and relational calculus. Relational algebra consists of a procedural set of operations—such as selection (), projection (), union (), set difference (), Cartesian product (), and rename ()—applied to relations as sets of tuples. Relational calculus, in contrast, is declarative: tuple relational calculus (TRC) uses formulas of the form , where is a tuple variable and is a first-order logic formula, while domain relational calculus (DRC) quantifies over domain variables, such as . Codd's theorem proves the computational equivalence of relational algebra and safe relational calculus, asserting that they possess identical expressive power for querying relational databases; specifically, for any query expressible in one, there exists an equivalent formulation in the other, ensuring that declarative specifications can always be translated into procedural executions without loss of capability. Dependency theory further solidifies these foundations by formalizing integrity constraints through functional dependencies (FDs), which capture semantic relationships in data. An FD on a relation schema means that the values of attributes in are uniquely determined by those in ; formally, for any two tuples , if , then . The Armstrong axioms form a sound and complete axiomatization for inferring all FDs from a given set:- Reflexivity: If , then .
- Augmentation: If , then for any set .
- Transitivity: If and , then .
These axioms, derivable from set inclusion properties, enable the computation of dependency closures and are essential for schema normalization and constraint enforcement, as they guarantee that all implied FDs can be systematically derived.[66]