Database
View on Wikipedia

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.
Before digital storage and retrieval of data have become widespread, index cards were used for data storage in a wide range of applications and environments: in the home to record and store recipes, shopping lists, contact information and other organizational data; in business to record presentation notes, project research and notes, and contact information; in schools as flash cards or other visual aids; and in academic research to hold data such as bibliographical citations or notes in a card file. Professional book indexers used index cards in the creation of book indexes until they were replaced by indexing software in the 1980s and 1990s.
Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spans formal techniques and practical considerations, including data modeling, efficient data representation and storage, query languages, security and privacy of sensitive data, and distributed computing issues, including supporting concurrent access and fault tolerance.
Computer scientists may classify database management systems according to the database models that they support. Relational databases became dominant in the 1980s. These model data as rows and columns in a series of tables, and the vast majority use SQL for writing and querying data. In the 2000s, non-relational databases became popular, collectively referred to as NoSQL, because they use different query languages.
Terminology and overview
[edit]Formally, a "database" refers to a set of related data accessed through the use of a "database management system" (DBMS), which is an integrated set of computer software that allows users to interact with one or more databases and provides access to all of the data contained in the database (although restrictions may exist that limit access to particular data). The DBMS provides various functions that allow entry, storage and retrieval of large quantities of information and provides ways to manage how that information is organized.
Because of the close relationship between them, the term "database" is often used casually to refer to both a database and the DBMS used to manipulate it.
Outside the world of professional information technology, the term database is often used to refer to any collection of related data (such as a spreadsheet or a card index) as size and usage requirements typically necessitate use of a database management system.[1]
Existing DBMSs provide various functions that allow management of a database and its data which can be classified into four main functional groups:
- Data definition – Creation, modification and removal of definitions that detail how the data is to be organized.
- Update – Insertion, modification, and deletion of the data itself.[2]
- Retrieval – Selecting data according to specified criteria (e.g., a query, a position in a hierarchy, or a position in relation to other data) and providing that data either directly to the user, or making it available for further processing by the database itself or by other applications. The retrieved data may be made available in a more or less direct form without modification, as it is stored in the database, or in a new form obtained by altering it or combining it with existing data from the database.[3]
- Administration – Registering and monitoring users, enforcing data security, monitoring performance, maintaining data integrity, dealing with concurrency control, and recovering information that has been corrupted by some event such as an unexpected system failure.[4]
Both a database and its DBMS conform to the principles of a particular database model.[5] "Database system" refers collectively to the database model, database management system, and database.[6]
Physically, database servers are dedicated computers that hold the actual databases and run only the DBMS and related software. Database servers are usually multiprocessor computers, with generous memory and RAID disk arrays used for stable storage. Hardware database accelerators, connected to one or more servers via a high-speed channel, are also used in large-volume transaction processing environments. DBMSs are found at the heart of most database applications. DBMSs may be built around a custom multitasking kernel with built-in networking support, but modern DBMSs typically rely on a standard operating system to provide these functions.[citation needed]
Since DBMSs comprise a significant market, computer and storage vendors often take into account DBMS requirements in their own development plans.[7]
Databases and DBMSs can be categorized according to the database model(s) that they support (such as relational or XML), the type(s) of computer they run on (from a server cluster to a mobile phone), the query language(s) used to access the database (such as SQL or XQuery), and their internal engineering, which affects performance, scalability, resilience, and security.
History
[edit]The sizes, capabilities, and performance of databases and their respective DBMSs have grown in orders of magnitude. These performance increases were enabled by the technology progress in the areas of processors, computer memory, computer storage, and computer networks. The concept of a database was made possible by the emergence of direct access storage media such as magnetic disks, which became widely available in the mid-1960s; earlier systems relied on sequential storage of data on magnetic tape. The subsequent development of database technology can be divided into three eras based on data model or structure: navigational,[8] SQL/relational, and post-relational.
The two main early navigational data models were the hierarchical model and the CODASYL model (network model). These were characterized by the use of pointers (often physical disk addresses) to follow relationships from one record to another.
The relational model, first proposed in 1970 by Edgar F. Codd, departed from this tradition by insisting that applications should search for data by content, rather than by following links. The relational model employs sets of ledger-style tables, each used for a different type of entity. Only in the mid-1980s did computing hardware become powerful enough to allow the wide deployment of relational systems (DBMSs plus applications). By the early 1990s, however, relational systems dominated in all large-scale data processing applications, and as of 2018[update] they remain dominant: IBM Db2, Oracle, MySQL, and Microsoft SQL Server are the most searched DBMS.[9] The dominant database language, standardized SQL for the relational model, has influenced database languages for other data models.[citation needed]
Object databases were developed in the 1980s to overcome the inconvenience of object–relational impedance mismatch, which led to the coining of the term "post-relational" and also the development of hybrid object–relational databases.
The next generation of post-relational databases in the late 2000s became known as NoSQL databases, introducing fast key–value stores and document-oriented databases. A competing "next generation" known as NewSQL databases attempted new implementations that retained the relational/SQL model while aiming to match the high performance of NoSQL compared to commercially available relational DBMSs.
1960s, navigational DBMS
[edit]
The introduction of the term database coincided with the availability of direct-access storage (disks and drums) from the mid-1960s onwards. The term represented a contrast with the tape-based systems of the past, allowing shared interactive use rather than daily batch processing. The Oxford English Dictionary cites a 1962 report by the System Development Corporation of California as the first to use the term "data-base" in a specific technical sense.[10]
As computers grew in speed and capability, a number of general-purpose database systems emerged; by the mid-1960s a number of such systems had come into commercial use. Interest in a standard began to grow, and Charles Bachman, author of one such product, the Integrated Data Store (IDS), founded the Database Task Group within CODASYL, the group responsible for the creation and standardization of COBOL. In 1971, the Database Task Group delivered their standard, which generally became known as the CODASYL approach, and soon a number of commercial products based on this approach entered the market.
The CODASYL approach offered applications the ability to navigate around a linked data set which was formed into a large network. Applications could find records by one of three methods:
- Use of a primary key (known as a CALC key, typically implemented by hashing)
- Navigating relationships (called sets) from one record to another
- Scanning all the records in a sequential order
Later systems added B-trees to provide alternate access paths. Many CODASYL databases also added a declarative query language for end users (as distinct from the navigational API). However, CODASYL databases were complex and required significant training and effort to produce useful applications.
IBM also had its own DBMS in 1966, known as Information Management System (IMS). IMS was a development of software written for the Apollo program on the System/360. IMS was generally similar in concept to CODASYL, but used a strict hierarchy for its model of data navigation instead of CODASYL's network model. Both concepts later became known as navigational databases due to the way data was accessed: the term was popularized by Bachman's 1973 Turing Award presentation The Programmer as Navigator. IMS is classified by IBM as a hierarchical database. IDMS and Cincom Systems' TOTAL databases are classified as network databases. IMS remains in use as of 2014[update].[11]
1970s, relational DBMS
[edit]Edgar F. Codd worked at IBM in San Jose, California, in an office primarily involved in the development of hard disk systems.[12] He was unhappy with the navigational model of the CODASYL approach, notably the lack of a "search" facility. In 1970, he wrote a number of papers that outlined a new approach to database construction that eventually culminated in the groundbreaking A Relational Model of Data for Large Shared Data Banks.[13]
The paper described a new system for storing and working with large databases. Instead of records being stored in some sort of linked list of free-form records as in CODASYL, Codd's idea was to organize the data as a number of "tables", each table being used for a different type of entity. Each table would contain a fixed number of columns containing the attributes of the entity. One or more columns of each table were designated as a primary key by which the rows of the table could be uniquely identified; cross-references between tables always used these primary keys, rather than disk addresses, and queries would join tables based on these key relationships, using a set of operations based on the mathematical system of relational calculus (from which the model takes its name). Splitting the data into a set of normalized tables (or relations) aimed to ensure that each "fact" was only stored once, thus simplifying update operations. Virtual tables called views could present the data in different ways for different users, but views could not be directly updated.
Codd used mathematical terms to define the model: relations, tuples, and domains rather than tables, rows, and columns. The terminology that is now familiar came from early implementations. Codd would later criticize the tendency for practical implementations to depart from the mathematical foundations on which the model was based.

The use of primary keys (user-oriented identifiers) to represent cross-table relationships, rather than disk addresses, had two primary motivations. From an engineering perspective, it enabled tables to be relocated and resized without expensive database reorganization. But Codd was more interested in the difference in semantics: the use of explicit identifiers made it easier to define update operations with clean mathematical definitions, and it also enabled query operations to be defined in terms of the established discipline of first-order predicate calculus; because these operations have clean mathematical properties, it becomes possible to rewrite queries in provably correct ways, which is the basis of query optimization. There is no loss of expressiveness compared with the hierarchic or network models, though the connections between tables are no longer so explicit.
In the hierarchic and network models, records were allowed to have a complex internal structure. For example, the salary history of an employee might be represented as a "repeating group" within the employee record. In the relational model, the process of normalization led to such internal structures being replaced by data held in multiple tables, connected only by logical keys.
For instance, a common use of a database system is to track information about users, their name, login information, various addresses and phone numbers. In the navigational approach, all of this data would be placed in a single variable-length record. In the relational approach, the data would be normalized into a user table, an address table and a phone number table (for instance). Records would be created in these optional tables only if the address or phone numbers were actually provided.
As well as identifying rows/records using logical identifiers rather than disk addresses, Codd changed the way in which applications assembled data from multiple records. Rather than requiring applications to gather data one record at a time by navigating the links, they would use a declarative query language that expressed what data was required, rather than the access path by which it should be found. Finding an efficient access path to the data became the responsibility of the database management system, rather than the application programmer. This process, called query optimization, depended on the fact that queries were expressed in terms of mathematical logic.
Codd's paper inspired teams at various universities to research the subject, including one at University of California, Berkeley[12] led by Eugene Wong and Michael Stonebraker, who started INGRES using funding that had already been allocated for a geographical database project and student programmers to produce code. Beginning in 1973, INGRES delivered its first test products which were generally ready for widespread use in 1979. INGRES was similar to System R in a number of ways, including the use of a "language" for data access, known as QUEL. Over time, INGRES moved to the emerging SQL standard.
IBM itself did one test implementation of the relational model, PRTV, and a production one, Business System 12, both now discontinued. Honeywell wrote MRDS for Multics, and now there are two new implementations: Alphora Dataphor and Rel. Most other DBMS implementations usually called relational are actually SQL DBMSs.
In 1970, the University of Michigan began development of the MICRO Information Management System[14] based on D.L. Childs' Set-Theoretic Data model.[15][16][17] The university in 1974 hosted a debate between Codd and Bachman which Bruce Lindsay of IBM later described as "throwing lightning bolts at each other!".[12] MICRO was used to manage very large data sets by the US Department of Labor, the U.S. Environmental Protection Agency, and researchers from the University of Alberta, the University of Michigan, and Wayne State University. It ran on IBM mainframe computers using the Michigan Terminal System.[18] The system remained in production until 1998.
Integrated approach
[edit]In the 1970s and 1980s, attempts were made to build database systems with integrated hardware and software. The underlying philosophy was that such integration would provide higher performance at a lower cost. Examples were IBM System/38, the early offering of Teradata, and the Britton Lee, Inc. database machine.
Another approach to hardware support for database management was ICL's CAFS accelerator, a hardware disk controller with programmable search capabilities. In the long term, these efforts were generally unsuccessful because specialized database machines could not keep pace with the rapid development and progress of general-purpose computers. Thus most database systems nowadays are software systems running on general-purpose hardware, using general-purpose computer data storage. However, this idea is still pursued in certain applications by some companies like Netezza and Oracle (Exadata).
Late 1970s, SQL DBMS
[edit]IBM formed a team led by Codd that started working on a prototype system, System R despite opposition from others at the company.[12] The first version was ready in 1974/5, and work then started on multi-table systems in which the data could be split so that all of the data for a record (some of which is optional) did not have to be stored in a single large "chunk". Subsequent multi-user versions were tested by customers in 1978 and 1979, by which time a standardized query language – SQL[citation needed] – had been added. Codd's ideas were establishing themselves as both workable and superior to CODASYL, pushing IBM to develop a true production version of System R, known as SQL/DS, and, later, Database 2 (IBM Db2).
Larry Ellison's Oracle Database (or more simply, Oracle) started from a different chain, based on IBM's papers on System R. Though Oracle V1 implementations were completed in 1978, it was not until Oracle Version 2 when Ellison beat IBM to market in 1979.[19]
Stonebraker went on to apply the lessons from INGRES to develop a new database, Postgres, which is now known as PostgreSQL. PostgreSQL is often used for global mission-critical applications (the .org and .info domain name registries use it as their primary data store, as do many large companies and financial institutions).
In Sweden, Codd's paper was also read and Mimer SQL was developed in the mid-1970s at Uppsala University. In 1984, this project was consolidated into an independent enterprise.
Another data model, the entity–relationship model, emerged in 1976 and gained popularity for database design as it emphasized a more familiar description than the earlier relational model. Later on, entity–relationship constructs were retrofitted as a data modeling construct for the relational model, and the difference between the two has become irrelevant.[citation needed]
1980s, on the desktop
[edit]Besides IBM and various software companies such as Sybase and Informix Corporation, most large computer hardware vendors by the 1980s had their own database systems such as DEC's VAX Rdb/VMS.[20] The decade ushered in the age of desktop computing. The new computers empowered their users with spreadsheets like Lotus 1-2-3 and database software like dBASE. The dBASE product was lightweight and easy for any computer user to understand out of the box. C. Wayne Ratliff, the creator of dBASE, stated: "dBASE was different from programs like BASIC, C, FORTRAN, and COBOL in that a lot of the dirty work had already been done. The data manipulation is done by dBASE instead of by the user, so the user can concentrate on what he is doing, rather than having to mess with the dirty details of opening, reading, and closing files, and managing space allocation."[21] dBASE was one of the top selling software titles in the 1980s and early 1990s.
1990s, object-oriented
[edit]By the start of the decade databases had become a billion-dollar industry in about ten years.[20] The 1990s, along with a rise in object-oriented programming, saw a growth in how data in various databases were handled. Programmers and designers began to treat the data in their databases as objects. That is to say that if a person's data were in a database, that person's attributes, such as their address, phone number, and age, were now considered to belong to that person instead of being extraneous data. This allows for relations between data to be related to objects and their attributes and not to individual fields.[22] The term "object–relational impedance mismatch" described the inconvenience of translating between programmed objects and database tables. Object databases and object–relational databases attempt to solve this problem by providing an object-oriented language (sometimes as extensions to SQL) that programmers can use as alternative to purely relational SQL. On the programming side, libraries known as object–relational mappings (ORMs) attempt to solve the same problem.
2000s, NoSQL and NewSQL
[edit]Database sales grew rapidly during the dotcom bubble and, after its end, the rise of ecommerce. The popularity of open source databases such as MySQL has grown since 2000, to the extent that Ken Jacobs of Oracle said in 2005 that perhaps "these guys are doing to us what we did to IBM".[20]
XML databases are a type of structured document-oriented database that allows querying based on XML document attributes. XML databases are mostly used in applications where the data is conveniently viewed as a collection of documents, with a structure that can vary from the very flexible to the highly rigid: examples include scientific articles, patents, tax filings, and personnel records.
NoSQL databases are often very fast,[23][24] do not require fixed table schemas, avoid join operations by storing denormalized data, and are designed to scale horizontally.
In recent years, there has been a strong demand for massively distributed databases with high partition tolerance, but according to the CAP theorem, it is impossible for a distributed system to simultaneously provide consistency, availability, and partition tolerance guarantees. A distributed system can satisfy any two of these guarantees at the same time, but not all three. For that reason, many NoSQL databases are using what is called eventual consistency to provide both availability and partition tolerance guarantees with a reduced level of data consistency.
NewSQL is a class of modern relational databases that aims to provide the same scalable performance of NoSQL systems for online transaction processing (read-write) workloads while still using SQL and maintaining the ACID guarantees of a traditional database system.
Use cases
[edit]Databases are used to support internal operations of organizations and to underpin online interactions with customers and suppliers (see Enterprise software).
Databases are used to hold administrative information and more specialized data, such as engineering data or economic models. Examples include computerized library systems, flight reservation systems, computerized parts inventory systems, and many content management systems that store websites as collections of webpages in a database.
Classification
[edit]One way to classify databases involves the type of their contents, for example: bibliographic, document-text, statistical, or multimedia objects. Another way is by their application area, for example: accounting, music compositions, movies, banking, manufacturing, or insurance. A third way is by some technical aspect, such as the database structure or interface type. This section lists a few of the adjectives used to characterize different kinds of databases.
- An in-memory database is a database that primarily resides in main memory, but is typically backed-up by non-volatile computer data storage. Main memory databases are faster than disk databases, and so are often used where response time is critical, such as in telecommunications network equipment.
- An active database includes an event-driven architecture which can respond to conditions both inside and outside the database. Possible uses include security monitoring, alerting, statistics gathering and authorization. Many databases provide active database features in the form of database triggers.
- A cloud database relies on cloud technology. Both the database and most of its DBMS reside remotely, "in the cloud", while its applications are both developed by programmers and later maintained and used by end-users through a web browser and Open APIs.
- Data warehouses[citation needed] archive data from operational databases and often from external sources such as market research firms. The warehouse becomes the central source of data for use by managers and other end-users who may not have access to operational data. For example, sales data might be aggregated to weekly totals and converted from internal product codes to use UPCs so that they can be compared with ACNielsen data. Some basic and essential components of data warehousing include extracting, analyzing, and mining data, transforming, loading, and managing data so as to make them available for further use.
- A deductive database combines logic programming with a relational database.
- A distributed database is one in which both the data and the DBMS span multiple computers.
- A document-oriented database is designed for storing, retrieving, and managing document-oriented, or semi structured, information. Document-oriented databases are one of the main categories of NoSQL databases.
- An embedded database system is a DBMS which is tightly integrated with an application software that requires access to stored data in such a way that the DBMS is hidden from the application's end-users and requires little or no ongoing maintenance.[25]
- End-user databases consist of data developed by individual end-users. Examples of these are collections of documents, spreadsheets, presentations, multimedia, and other files. Several products[which?] exist to support such databases.
- A federated database system comprises several distinct databases, each with its own DBMS. It is handled as a single database by a federated database management system (FDBMS), which transparently integrates multiple autonomous DBMSs, possibly of different types (in which case it would also be a heterogeneous database system), and provides them with an integrated conceptual view.
- Sometimes the term multi-database is used as a synonym for federated database, though it may refer to a less integrated (e.g., without an FDBMS and a managed integrated schema) group of databases that cooperate in a single application. In this case, typically middleware is used for distribution, which typically includes an atomic commit protocol (ACP), e.g., the two-phase commit protocol, to allow distributed (global) transactions across the participating databases.
- A graph database is a kind of NoSQL database that uses graph structures with nodes, edges, and properties to represent and store information. General graph databases that can store any graph are distinct from specialized graph databases such as triplestores and network databases.
- An array DBMS is a kind of NoSQL DBMS that allows modeling, storage, and retrieval of (usually large) multi-dimensional arrays such as satellite images and climate simulation output.
- In a hypertext or hypermedia database, any word or a piece of text representing an object, e.g., another piece of text, an article, a picture, or a film, can be hyperlinked to that object. Hypertext databases are particularly useful for organizing large amounts of disparate information. For example, they are useful for organizing online encyclopedias, where users can conveniently jump around the text. The World Wide Web is thus a large distributed hypertext database.
- A knowledge base (abbreviated KB, kb or Δ[26][27]) is a special kind of database for knowledge management, providing the means for the computerized collection, organization, and retrieval of knowledge. Also a collection of data representing problems with their solutions and related experiences.
- A mobile database can be carried on or synchronized from a mobile computing device.
- Operational databases store detailed data about the operations of an organization. They typically process relatively high volumes of updates using transactions. Examples include customer databases that record contact, credit, and demographic information about a business's customers, personnel databases that hold information such as salary, benefits, skills data about employees, enterprise resource planning systems that record details about product components, parts inventory, and financial databases that keep track of the organization's money, accounting and financial dealings.
- A parallel database seeks to improve performance through parallelization for tasks such as loading data, building indexes and evaluating queries.
- The major parallel DBMS architectures which are induced by the underlying hardware architecture are:
- Shared memory architecture, where multiple processors share the main memory space, as well as other data storage.
- Shared disk architecture, where each processing unit (typically consisting of multiple processors) has its own main memory, but all units share the other storage.
- Shared-nothing architecture, where each processing unit has its own main memory and other storage.
- The major parallel DBMS architectures which are induced by the underlying hardware architecture are:
- Probabilistic databases employ fuzzy logic to draw inferences from imprecise data.
- Real-time databases process transactions fast enough for the result to come back and be acted on right away.
- A spatial database can store the data with multidimensional features. The queries on such data include location-based queries, like "Where is the closest hotel in my area?".
- A temporal database has built-in time aspects, for example a temporal data model and a temporal version of SQL. More specifically the temporal aspects usually include valid-time and transaction-time.
- A terminology-oriented database builds upon an object-oriented database, often customized for a specific field.
- An unstructured data database is intended to store in a manageable and protected way diverse objects that do not fit naturally and conveniently in common databases. It may include email messages, documents, journals, multimedia objects, etc. The name may be misleading since some objects can be highly structured. However, the entire possible object collection does not fit into a predefined structured framework. Most established DBMSs now support unstructured data in various ways, and new dedicated DBMSs are emerging.
Database management system
[edit]Connolly and Begg define database management system (DBMS) as a "software system that enables users to define, create, maintain and control access to the database."[28] Examples of DBMS's include MySQL, MariaDB, PostgreSQL, Microsoft SQL Server, Oracle Database, and Microsoft Access.
The DBMS acronym is sometimes extended to indicate the underlying database model, with RDBMS for the relational, OODBMS for the object (oriented) and ORDBMS for the object–relational model. Other extensions can indicate some other characteristics, such as DDBMS for a distributed database management systems.
The functionality provided by a DBMS can vary enormously. The core functionality is the storage, retrieval and update of data. Codd proposed the following functions and services a fully-fledged general purpose DBMS should provide:[29]
- Data storage, retrieval and update
- User accessible catalog or data dictionary describing the metadata
- Support for transactions and concurrency
- Facilities for recovering the database should it become damaged
- Support for authorization of access and update of data
- Access support from remote locations
- Enforcing constraints to ensure data in the database abides by certain rules
It is also generally to be expected the DBMS will provide a set of utilities for such purposes as may be necessary to administer the database effectively, including import, export, monitoring, defragmentation and analysis utilities.[30] The core part of the DBMS interacting between the database and the application interface sometimes referred to as the database engine.
Often DBMSs will have configuration parameters that can be statically and dynamically tuned, for example the maximum amount of main memory on a server the database can use. The trend is to minimize the amount of manual configuration, and for cases such as embedded databases the need to target zero-administration is paramount.
The large major enterprise DBMSs have tended to increase in size and functionality and have involved up to thousands of human years of development effort throughout their lifetime.[a]
Early multi-user DBMS typically only allowed for the application to reside on the same computer with access via terminals or terminal emulation software. The client–server architecture was a development where the application resided on a client desktop and the database on a server allowing the processing to be distributed. This evolved into a multitier architecture incorporating application servers and web servers with the end user interface via a web browser with the database only directly connected to the adjacent tier.[32]
A general-purpose DBMS will provide public application programming interfaces (API) and optionally a processor for database languages such as SQL to allow applications to be written to interact with and manipulate the database. A special purpose DBMS may use a private API and be specifically customized and linked to a single application. For example, an email system performs many of the functions of a general-purpose DBMS such as message insertion, message deletion, attachment handling, blocklist lookup, associating messages an email address and so forth however these functions are limited to what is required to handle email.
Application
[edit]External interaction with the database will be via an application program that interfaces with the DBMS.[33] This can range from a database tool that allows users to execute SQL queries textually or graphically, to a website that happens to use a database to store and search information.
Application program interface
[edit]A programmer will code interactions to the database (sometimes referred to as a datasource) via an application program interface (API) or via a database language. The particular API or language chosen will need to be supported by DBMS, possibly indirectly via a preprocessor or a bridging API. Some API's aim to be database independent, ODBC being a commonly known example. Other common API's include JDBC and ADO.NET.
Database languages
[edit]Database languages are special-purpose languages, which allow one or more of the following tasks, sometimes distinguished as sublanguages:
- Data control language (DCL) – controls access to data;
- Data definition language (DDL) – defines data types such as creating, altering, or dropping tables and the relationships among them;
- Data manipulation language (DML) – performs tasks such as inserting, updating, or deleting data occurrences;
- Data query language (DQL) – allows searching for information and computing derived information.
Database languages are specific to a particular data model. Notable examples include:
- SQL combines the roles of data definition, data manipulation, and query in a single language. It was one of the first commercial languages for the relational model, although it departs in some respects from the relational model as described by Codd (for example, the rows and columns of a table can be ordered). SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of the International Organization for Standardization (ISO) in 1987. The standards have been regularly enhanced since and are supported (with varying degrees of conformance) by all mainstream commercial relational DBMSs.[34][35]
- OQL is an object model language standard (from the Object Data Management Group). It has influenced the design of some of the newer query languages like JDOQL and EJB QL.
- XQuery is a standard XML query language implemented by XML database systems such as MarkLogic and eXist, by relational databases with XML capability such as Oracle and Db2, and also by in-memory XML processors such as Saxon.
- SQL/XML combines XQuery with SQL.[36]
A database language may also incorporate features like:
- DBMS-specific configuration and storage engine management
- Computations to modify query results, like counting, summing, averaging, sorting, grouping, and cross-referencing
- Constraint enforcement (e.g. in an automotive database, only allowing one engine type per car)
- Application programming interface version of the query language, for programmer convenience
Storage
[edit]Database storage is the container of the physical materialization of a database. It comprises the internal (physical) level in the database architecture. It also contains all the information needed (e.g., metadata, "data about the data", and internal data structures) to reconstruct the conceptual level and external level from the internal level when needed. Databases as digital objects contain three layers of information which must be stored: the data, the structure, and the semantics. Proper storage of all three layers is needed for future preservation and longevity of the database.[37] Putting data into permanent storage is generally the responsibility of the database engine a.k.a. "storage engine". Though typically accessed by a DBMS through the underlying operating system (and often using the operating systems' file systems as intermediates for storage layout), storage properties and configuration settings are extremely important for the efficient operation of the DBMS, and thus are closely maintained by database administrators. A DBMS, while in operation, always has its database residing in several types of storage (e.g., memory and external storage). The database data and the additional needed information, possibly in very large amounts, are coded into bits. Data typically reside in the storage in structures that look completely different from the way the data look at the conceptual and external levels, but in ways that attempt to optimize (the best possible) these levels' reconstruction when needed by users and programs, as well as for computing additional types of needed information from the data (e.g., when querying the database).
Some DBMSs support specifying which character encoding was used to store data, so multiple encodings can be used in the same database.
Various low-level database storage structures are used by the storage engine to serialize the data model so it can be written to the medium of choice. Techniques such as indexing may be used to improve performance. Conventional storage is row-oriented, but there are also column-oriented and correlation databases.
Materialized views
[edit]Often storage redundancy is employed to increase performance. A common example is storing materialized views, which consist of frequently needed external views or query results. Storing such views saves the expensive computing them each time they are needed. The downsides of materialized views are the overhead incurred when updating them to keep them synchronized with their original updated database data, and the cost of storage redundancy.
Replication
[edit]Occasionally a database employs storage redundancy by database objects replication (with one or more copies) to increase data availability (both to improve performance of simultaneous multiple end-user accesses to the same database object, and to provide resiliency in a case of partial failure of a distributed database). Updates of a replicated object need to be synchronized across the object copies. In many cases, the entire database is replicated.
Virtualization
[edit]With data virtualization, the data used remains in its original locations and real-time access is established to allow analytics across multiple sources. This can aid in resolving some technical difficulties such as compatibility problems when combining data from various platforms, lowering the risk of error caused by faulty data, and guaranteeing that the newest data is used. Furthermore, avoiding the creation of a new database containing personal information can make it easier to comply with privacy regulations. However, with data virtualization, the connection to all necessary data sources must be operational as there is no local copy of the data, which is one of the main drawbacks of the approach.[38]
Security
[edit]This article appears to contradict the article Database security. (March 2013) |
Database security deals with all various aspects of protecting the database content, its owners, and its users. It ranges from protection from intentional unauthorized database uses to unintentional database accesses by unauthorized entities (e.g., a person or a computer program).
Database access control deals with controlling who (a person or a certain computer program) are allowed to access what information in the database. The information may comprise specific database objects (e.g., record types, specific records, data structures), certain computations over certain objects (e.g., query types, or specific queries), or using specific access paths to the former (e.g., using specific indexes or other data structures to access information). Database access controls are set by special authorized (by the database owner) personnel that uses dedicated protected security DBMS interfaces.
This may be managed directly on an individual basis, or by the assignment of individuals and privileges to groups, or (in the most elaborate models) through the assignment of individuals and groups to roles which are then granted entitlements. Data security prevents unauthorized users from viewing or updating the database. Using passwords, users are allowed access to the entire database or subsets of it called "subschemas". For example, an employee database can contain all the data about an individual employee, but one group of users may be authorized to view only payroll data, while others are allowed access to only work history and medical data. If the DBMS provides a way to interactively enter and update the database, as well as interrogate it, this capability allows for managing personal databases.
Data security in general deals with protecting specific chunks of data, both physically (i.e., from corruption, or destruction, or removal; e.g., see physical security), or the interpretation of them, or parts of them to meaningful information (e.g., by looking at the strings of bits that they comprise, concluding specific valid credit-card numbers; e.g., see data encryption).
Change and access logging records who accessed which attributes, what was changed, and when it was changed. Logging services allow for a forensic database audit later by keeping a record of access occurrences and changes. Sometimes application-level code is used to record changes rather than leaving this in the database. Monitoring can be set up to attempt to detect security breaches. Therefore, organizations must take database security seriously because of the many benefits it provides. Organizations will be safeguarded from security breaches and hacking activities like firewall intrusion, virus spread, and ransom ware. This helps in protecting the company's essential information, which cannot be shared with outsiders at any cause.[39]
Transactions and concurrency
[edit]Database transactions can be used to introduce some level of fault tolerance and data integrity after recovery from a crash. A database transaction is a unit of work, typically encapsulating a number of operations over a database (e.g., reading a database object, writing, acquiring or releasing a lock, etc.), an abstraction supported in database and also other systems. Each transaction has well defined boundaries in terms of which program/code executions are included in that transaction (determined by the transaction's programmer via special transaction commands).
The acronym ACID describes some ideal properties of a database transaction: atomicity, consistency, isolation, and durability.
Migration
[edit]A database built with one DBMS is not portable to another DBMS (i.e., the other DBMS cannot run it). However, in some situations, it is desirable to migrate a database from one DBMS to another. The reasons are primarily economical (different DBMSs may have different total costs of ownership or TCOs), functional, and operational (different DBMSs may have different capabilities). The migration involves the database's transformation from one DBMS type to another. The transformation should maintain (if possible) the database related application (i.e., all related application programs) intact. Thus, the database's conceptual and external architectural levels should be maintained in the transformation. It may be desired that also some aspects of the architecture internal level are maintained. A complex or large database migration may be a complicated and costly (one-time) project by itself, which should be factored into the decision to migrate. This is in spite of the fact that tools may exist to help migration between specific DBMSs. Typically, a DBMS vendor provides tools to help import databases from other popular DBMSs.
Building, maintaining, and tuning
[edit]After designing a database for an application, the next stage is building the database. Typically, an appropriate general-purpose DBMS can be selected to be used for this purpose. A DBMS provides the needed user interfaces to be used by database administrators to define the needed application's data structures within the DBMS's respective data model. Other user interfaces are used to select needed DBMS parameters (like security related, storage allocation parameters, etc.).
When the database is ready (all its data structures and other needed components are defined), it is typically populated with initial application's data (database initialization, which is typically a distinct project; in many cases using specialized DBMS interfaces that support bulk insertion) before making it operational. In some cases, the database becomes operational while empty of application data, and data are accumulated during its operation.
After the database is created, initialized and populated it needs to be maintained. Various database parameters may need changing and the database may need to be tuned (tuning) for better performance; application's data structures may be changed or added, new related application programs may be written to add to the application's functionality, etc.
Backup and restore
[edit]Sometimes it is desired to bring a database back to a previous state (for many reasons, e.g., cases when the database is found corrupted due to a software error, or if it has been updated with erroneous data). To achieve this, a backup operation is done occasionally or continuously, where each desired database state (i.e., the values of its data and their embedding in database's data structures) is kept within dedicated backup files (many techniques exist to do this effectively). When it is decided by a database administrator to bring the database back to this state (e.g., by specifying this state by a desired point in time when the database was in this state), these files are used to restore that state.
Static analysis
[edit]Static analysis techniques for software verification can be applied also in the scenario of query languages. In particular, the *Abstract interpretation framework has been extended to the field of query languages for relational databases as a way to support sound approximation techniques.[40] The semantics of query languages can be tuned according to suitable abstractions of the concrete domain of data. The abstraction of relational database systems has many interesting applications, in particular, for security purposes, such as fine-grained access control, watermarking, etc.
Miscellaneous features
[edit]Other DBMS features might include:
- Database logs – This helps in keeping a history of the executed functions.
- Graphics component for producing graphs and charts, especially in a data warehouse system.
- Query optimizer – Performs query optimization on every query to choose an efficient query plan (a partial order (tree) of operations) to be executed to compute the query result. May be specific to a particular storage engine.
- Tools or hooks for database design, application programming, application program maintenance, database performance analysis and monitoring, database configuration monitoring, DBMS hardware configuration (a DBMS and related database may span computers, networks, and storage units) and related database mapping (especially for a distributed DBMS), storage allocation and database layout monitoring, storage migration, etc.
Increasingly, there are calls for a single system that incorporates all of these core functionalities into the same build, test, and deployment framework for database management and source control. Borrowing from other developments in the software industry, some market such offerings as "DevOps for database".[41]
Design and modeling
[edit]
The first task of a database designer is to produce a conceptual data model that reflects the structure of the information to be held in the database. A common approach to this is to develop an entity–relationship model, often with the aid of drawing tools. Another popular approach is the Unified Modeling Language. A successful data model will accurately reflect the possible state of the external world being modeled: for example, if people can have more than one phone number, it will allow this information to be captured. Designing a good conceptual data model requires a good understanding of the application domain; it typically involves asking deep questions about the things of interest to an organization, like "can a customer also be a supplier?", or "if a product is sold with two different forms of packaging, are those the same product or different products?", or "if a plane flies from New York to Dubai via Frankfurt, is that one flight or two (or maybe even three)?". The answers to these questions establish definitions of the terminology used for entities (customers, products, flights, flight segments) and their relationships and attributes.
Producing the conceptual data model sometimes involves input from business processes, or the analysis of workflow in the organization. This can help to establish what information is needed in the database, and what can be left out. For example, it can help when deciding whether the database needs to hold historic data as well as current data.
Having produced a conceptual data model that users are happy with, the next stage is to translate this into a schema that implements the relevant data structures within the database. This process is often called logical database design, and the output is a logical data model expressed in the form of a schema. Whereas the conceptual data model is (in theory at least) independent of the choice of database technology, the logical data model will be expressed in terms of a particular database model supported by the chosen DBMS. (The terms data model and database model are often used interchangeably, but in this article we use data model for the design of a specific database, and database model for the modeling notation used to express that design).
The most popular database model for general-purpose databases is the relational model, or more precisely, the relational model as represented by the SQL language. The process of creating a logical database design using this model uses a methodical approach known as normalization. The goal of normalization is to ensure that each elementary "fact" is only recorded in one place, so that insertions, updates, and deletions automatically maintain consistency.
The final stage of database design is to make the decisions that affect performance, scalability, recovery, security, and the like, which depend on the particular DBMS. This is often called physical database design, and the output is the physical data model. A key goal during this stage is data independence, meaning that the decisions made for performance optimization purposes should be invisible to end-users and applications. There are two types of data independence: Physical data independence and logical data independence. Physical design is driven mainly by performance requirements, and requires a good knowledge of the expected workload and access patterns, and a deep understanding of the features offered by the chosen DBMS.
Another aspect of physical database design is security. It involves both defining access control to database objects as well as defining security levels and methods for the data itself.
Models
[edit]
A database model is a type of data model that determines the logical structure of a database and fundamentally determines in which manner data can be stored, organized, and manipulated. The most popular example of a database model is the relational model (or the SQL approximation of relational), which uses a table-based format.
Common logical data models for databases include:
- Navigational databases
- Relational model
- Entity–relationship model
- Object model
- Document model
- Entity–attribute–value model
- Star schema
An object–relational database combines the two related structures.
Physical data models include:
Other models include:
Specialized models are optimized for particular types of data:
External, conceptual, and internal views
[edit]
A database management system provides three views of the database data:
- The external level defines how each group of end-users sees the organization of data in the database. A single database can have any number of views at the external level.
- The conceptual level (or logical level) unifies the various external views into a compatible global view.[43] It provides the synthesis of all the external views. It is out of the scope of the various database end-users, and is rather of interest to database application developers and database administrators.
- The internal level (or physical level) is the internal organization of data inside a DBMS. It is concerned with cost, performance, scalability and other operational matters. It deals with storage layout of the data, using storage structures such as indexes to enhance performance. Occasionally it stores data of individual views (materialized views), computed from generic data, if performance justification exists for such redundancy. It balances all the external views' performance requirements, possibly conflicting, in an attempt to optimize overall performance across all activities.
While there is typically only one conceptual and internal view of the data, there can be any number of different external views. This allows users to see database information in a more business-related way rather than from a technical, processing viewpoint. For example, a financial department of a company needs the payment details of all employees as part of the company's expenses, but does not need details about employees that are in the interest of the human resources department. Thus different departments need different views of the company's database.
The three-level database architecture relates to the concept of data independence which was one of the major initial driving forces of the relational model.[43] The idea is that changes made at a certain level do not affect the view at a higher level. For example, changes in the internal level do not affect application programs written using conceptual level interfaces, which reduces the impact of making physical changes to improve performance.
The conceptual view provides a level of indirection between internal and external. On the one hand it provides a common view of the database, independent of different external view structures, and on the other hand it abstracts away details of how the data are stored or managed (internal level). In principle every level, and even every external view, can be presented by a different data model. In practice usually a given DBMS uses the same data model for both the external and the conceptual levels (e.g., relational model). The internal level, which is hidden inside the DBMS and depends on its implementation, requires a different level of detail and uses its own types of data structure types.
Research
[edit]Database technology has been an active research topic since the 1960s, both in academia and in the research and development groups of companies (for example IBM Research). Research activity includes theory and development of prototypes. Notable research topics have included models, the atomic transaction concept, related concurrency control techniques, query languages and query optimization methods, RAID, and more.
The database research area has several dedicated academic journals (for example, ACM Transactions on Database Systems-TODS, Data and Knowledge Engineering-DKE) and annual conferences (e.g., ACM SIGMOD, ACM PODS, VLDB, IEEE ICDE).
See also
[edit]- Comparison of database tools
- Comparison of object database management systems
- Comparison of object–relational database management systems
- Comparison of relational database management systems
- Data bank – Organized collection of data in computing
- Data hierarchy – Systematic organization of data
- Data store – Repository for data collection storage and management
- Database testing – Testing of database software systems
- Database theory – Study of database design and use
- Database-as-IPC – Misusing databases for temporary messages
- Database-centric architecture – Software architecture
- Datalog – Declarative logic programming language
- DBOS – Commercial services on top of an open source library that provides Durable Computing
- Flat-file database – Database stored as flat data
- INP (database) – Early database management system
- Journal of Database Management
- Casio Databank – Brand of watch
- Conformational dynamics data bank – Database of protein conformations
- Data repository – Long-term storage of research data
- Databank Systems – Financial computing shared service
- Dortmund Data Bank
- Electron Microscopy Data Bank
- Hazardous Substances Data Bank – Database of toxic compounds
- List of databases
- Memory bank – Logical unit of storage in computer architecture
- National Trauma Data Bank – Compilation of U.S. traumatic injury data from participating institutions
- Protein Data Bank – International open access database of large biological molecules
- Star Wars Databank
Notes
[edit]References
[edit]- ^ Ullman & Widom 1997, p. 1.
- ^ "Update Definition & Meaning". Merriam-Webster. Archived from the original on Feb 25, 2024.
- ^ "Retrieval Definition & Meaning". Merriam-Webster. Archived from the original on Jun 27, 2023.
- ^ "Administration Definition & Meaning". Merriam-Webster. Archived from the original on Dec 6, 2023.
- ^ Tsitchizris & Lochovsky 1982.
- ^ Beynon-Davies 2003.
- ^ Nelson & Nelson 2001.
- ^ Bachman 1973.
- ^ "TOPDB Top Database index". pypl.github.io.
- ^ "database, n". OED Online. Oxford University Press. June 2013. Retrieved July 12, 2013. (Subscription required.)
- ^ IBM Corporation (October 2013). "IBM Information Management System (IMS) 13 Transaction and Database Servers delivers high performance and low total cost of ownership". Retrieved Feb 20, 2014.
- ^ a b c d "RDBMS Plenary 1: Early Years" (PDF) (Interview). Interviewed by Burton Grad. Computer History Museum. 2007-06-12. Retrieved 2025-05-30.
- ^ Codd 1970.
- ^ Hershey & Easthope 1972.
- ^ North 2010.
- ^ Childs 1968a.
- ^ Childs 1968b.
- ^ M.A. Kahn; D.L. Rumelhart; B.L. Bronson (October 1977). MICRO Information Management System (Version 5.0) Reference Manual. Institute of Labor and Industrial Relations (ILIR), University of Michigan and Wayne State University.
- ^ "Oracle 30th Anniversary Timeline" (PDF). Archived (PDF) from the original on 2011-03-20. Retrieved 23 August 2017.
- ^ a b c "RDBMS Plenary Session: The Later Years" (PDF) (Interview). Interviewed by Burton Grad. Computer History Museum. 2007-06-12. Retrieved 2025-05-30.
- ^ Interview with Wayne Ratliff. The FoxPro History. Retrieved on 2013-07-12.
- ^ Development of an object-oriented DBMS; Portland, Oregon, United States; Pages: 472–482; 1986; ISBN 0-89791-204-7
- ^ Jordan, Meghan. "NoSQL Latency". ScyllaDB. Retrieved 2025-06-09.
- ^ "SQL vs. NoSQL: Full comparison of features, differences, and more". www.testgorilla.com. Retrieved 2025-06-09.
- ^ Graves, Steve. "COTS Databases For Embedded Systems" Archived 2007-11-14 at the Wayback Machine, Embedded Computing Design magazine, January 2007. Retrieved on August 13, 2008.
- ^ Argumentation in Artificial Intelligence by Iyad Rahwan, Guillermo R. Simari
- ^ "OWL DL Semantics". Retrieved 10 December 2010.
- ^ Connolly & Begg 2014, p. 64.
- ^ Connolly & Begg 2014, pp. 97–102.
- ^ Connolly & Begg 2014, p. 102.
- ^ Chong et al. 2007.
- ^ Connolly & Begg 2014, pp. 106–113.
- ^ Connolly & Begg 2014, p. 65.
- ^ Chapple 2005.
- ^ "Structured Query Language (SQL)". International Business Machines. October 27, 2006. Retrieved 2007-06-10.
- ^ Wagner 2010.
- ^ Ramalho, J.C.; Faria, L.; Helder, S.; Coutada, M. (31 December 2013). "Database Preservation Toolkit: A flexible tool to normalize and give access to databases". Biblioteca Nacional de Portugal. University of Minho.
- ^ Paiho, Satu; Tuominen, Pekka; Rökman, Jyri; Ylikerälä, Markus; Pajula, Juha; Siikavirta, Hanne (2022). "Opportunities of collected city data for smart cities". IET Smart Cities. 4 (4): 275–291. doi:10.1049/smc2.12044. ISSN 2631-7680. S2CID 253467923.
- ^ David Y. Chan; Victoria Chiu; Miklos A. Vasarhelyi (2018). Continuous auditing : theory and application (1st ed.). Bingley, UK: Emerald Publishing. ISBN 978-1-78743-413-4. OCLC 1029759767.
- ^ Halder & Cortesi 2011.
- ^ Ben Linders (January 28, 2016). "How Database Administration Fits into DevOps". Retrieved April 15, 2017.
- ^ itl.nist.gov (1993) Integration Definition for Information Modeling (IDEFIX) Archived 2013-12-03 at the Wayback Machine. 21 December 1993.
- ^ a b Date 2003, pp. 31–32.
Sources
[edit]- Bachman, Charles W. (1973). "The Programmer as Navigator". Communications of the ACM. 16 (11): 653–658. doi:10.1145/355611.362534.
- Beynon-Davies, Paul (2003). Database Systems (3rd ed.). Palgrave Macmillan. ISBN 978-1403916013.
- Chapple, Mike (2005). "SQL Fundamentals". Databases. About.com. Archived from the original on 22 February 2009. Retrieved 28 January 2009.
- Childs, David L. (1968a). Description of a set-theoretic data structure (PDF) (Technical report). CONCOMP (Research in Conversational Use of Computers) Project. University of Michigan. Technical Report 3.
- Childs, David L. (1968b). Feasibility of a set-theoretic data structure: a general structure based on a reconstituted definition (PDF) (Technical report). CONCOMP (Research in Conversational Use of Computers) Project. University of Michigan. Technical Report 6.
- Chong, Raul F.; Wang, Xiaomei; Dang, Michael; Snow, Dwaine R. (2007). "Introduction to DB2". Understanding DB2: Learning Visually with Examples (2nd ed.). IBM Press Pearson plc. ISBN 978-0131580183. Retrieved 17 March 2013.
- Codd, Edgar F. (1970). "A Relational Model of Data for Large Shared Data Banks" (PDF). Communications of the ACM. 13 (6): 377–387. doi:10.1145/362384.362685. S2CID 207549016.
- Connolly, Thomas M.; Begg, Carolyn E. (2014). Database Systems – A Practical Approach to Design Implementation and Management (6th ed.). Pearson. ISBN 978-1292061184.
- Date, C. J. (2003). An Introduction to Database Systems (8th ed.). Pearson. ISBN 978-0321197849.
- Halder, Raju; Cortesi, Agostino (2011). "Abstract Interpretation of Database Query Languages" (PDF). Computer Languages, Systems & Structures. 38 (2): 123–157. doi:10.1016/j.cl.2011.10.004. ISSN 1477-8424. Archived from the original (PDF) on 2024-11-23. Retrieved 2015-06-18.
- Hershey, William; Easthope, Carol (1972). A set theoretic data structure and retrieval language. Spring Joint Computer Conference, May 1972. ACM SIGIR Forum. Vol. 7, no. 4. pp. 45–55. doi:10.1145/1095495.1095500.
- Nelson, Anne Fulcher; Nelson, William Harris Morehead (2001). Building Electronic Commerce: With Web Database Constructions. Prentice Hall. ISBN 978-0201741308.
- North, Ken (10 March 2010). "Sets, Data Models and Data Independence". Dr. Dobb's. Archived from the original on 24 October 2012.
- Tsitchizris, Dionysios C.; Lochovsky, Fred H. (1982). Data Models. Prentice–Hall. ISBN 978-0131964280.
- Ullman, Jeffrey; Widom, Jennifer (1997). A First Course in Database Systems. Prentice–Hall. ISBN 978-0138613372.
- Wagner, Michael (2010), SQL/XML:2006 – Evaluierung der Standardkonformität ausgewählter Datenbanksysteme, Diplomica Verlag, ISBN 978-3836696098
Further reading
[edit]- Ling Liu and Tamer M. Özsu (Eds.) (2009). "Encyclopedia of Database Systems, 4100 p. 60 illus. ISBN 978-0-387-49616-0.
- Gray, J. and Reuter, A. Transaction Processing: Concepts and Techniques, 1st edition, Morgan Kaufmann Publishers, 1992.
- Kroenke, David M. and David J. Auer. Database Concepts. 3rd ed. New York: Prentice, 2007.
- Raghu Ramakrishnan and Johannes Gehrke, Database Management Systems.
- Abraham Silberschatz, Henry F. Korth, S. Sudarshan, Database System Concepts.
- Lightstone, S.; Teorey, T.; Nadeau, T. (2007). Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more. Morgan Kaufmann Press. ISBN 978-0-12-369389-1.
- Teorey, T.; Lightstone, S. and Nadeau, T. Database Modeling & Design: Logical Design, 4th edition, Morgan Kaufmann Press, 2005. ISBN 0-12-685352-5.
- CMU Database courses playlist
- MIT OCW 6.830 | Fall 2010 | Database Systems
- Berkeley CS W186
External links
[edit]- DB File extension – information about files with the DB extension
Database
View on GrokipediaOverview
Definition
A database is an organized collection of structured data, typically stored electronically in a computer system.[1] This structure enables efficient storage, retrieval, and manipulation of information, often modeled in rows and columns within tables for relational databases, though other formats exist for non-relational types.[1] A database is usually controlled by a database management system (DBMS), which acts as an interface between the data and users or applications, facilitating administrative tasks such as performance monitoring, backup, and recovery.[1] The primary purpose of a database is to manage, store, retrieve, and update large volumes of information reliably and efficiently.[2] Databases support multi-user access, allowing simultaneous queries and modifications while maintaining data integrity (accuracy and completeness), security (through controls like role-based access), and consistency (ensuring data remains synchronized and reliable across operations).[3] These features make databases essential for handling complex data needs beyond what simpler tools can provide. Unlike traditional file systems, which store data in independent files without built-in mechanisms for enforcing relationships, concurrency control, or integrity constraints, databases offer centralized management to reduce redundancy and errors.[5] Similarly, while spreadsheets are suitable for small-scale, single-user tasks involving calculations and basic data manipulation, databases are designed for larger, structured datasets with advanced querying and multi-user support.[6] This distinction enables databases to serve as robust foundations for modern applications requiring scalability and reliability.Key characteristics
Databases are distinguished by several essential characteristics that enable them to manage large volumes of structured data effectively, far beyond simple file storage systems. A primary trait is their self-describing nature, where the system stores not only the data but also metadata describing the structure, relationships, and constraints, allowing the database to operate independently of specific applications.[7][8] Data independence is another key property, separating the logical organization of data from physical storage and application programs; changes to storage structures or access methods do not require modifications to programs that use the data.[7][9] Databases support persistence through durability, ensuring that committed changes survive system failures, and multi-user access, enabling concurrent operations while maintaining consistency via concurrency control mechanisms.[10][8] They incorporate controlled redundancy, ideally storing each data item in one place to minimize duplication while allowing necessary replication for performance, and provide efficient querying through structured organization and optimized retrieval methods.[9] Core design goals include data integrity, enforced through constraints and ACID properties (Atomicity, Consistency, Isolation, Durability), which guarantee that transactions complete reliably and the database transitions only between valid states.[10] Security restricts unauthorized access via user privileges and authentication, while consistency ensures data remains accurate and coherent across operations.[11][9] Databases encompass various types, such as relational and NoSQL, which adapt these characteristics to different use cases.Importance and applications
Databases are the backbone of modern digital systems, storing, organizing, and managing data in ways that make it accessible, secure, and actionable across virtually every computing application.[12] They power websites, mobile apps, enterprise platforms, and real-time analytics, enabling concurrent access, data consistency, and integration with applications at scale.[12] Without databases, modern software systems could not efficiently handle the vast amounts of data generated daily, nor support the performance and reliability demanded by contemporary applications.[13] Databases enable data-driven decision-making by facilitating analysis that identifies trends, patterns, and predictions, which helps organizations operate with greater confidence and efficiency.[14] In business, they underpin critical functions such as customer relationship management, inventory tracking, financial reporting, fraud detection, and transaction processing in banking and e-commerce.[12][14] For example, relational databases manage structured customer and order data across retail locations, while other types support caching for faster website performance or storing user-generated content on social platforms.[12] Databases also manage large and complex datasets in scientific research and academia—including simulation models of real-world entities and experimental results—supporting knowledge advancement and collaborative analysis.[15][16] In government and public sectors, they support administrative operations, public service delivery, and evidence-based policy through secure, scalable information management.[17] Overall, databases enable scalability, data integrity, and security, making them indispensable for information management across various sectors including business, research, and public administration.[12][14][17]History
1960s: Navigational databases
In the 1960s, the first computerized database management systems emerged, adopting navigational approaches to data access. These systems required programmers to traverse data structures explicitly by following pointers or links between records, typically processing one record at a time in a "record-at-a-time" manner. This navigational paradigm contrasted with later declarative methods and was implemented in both hierarchical and network models.[18][19] A pioneering effort was Charles W. Bachman's Integrated Data Store (IDS), developed at General Electric. Functional specifications for IDS were completed in early 1962, with a prototype operational by the end of that year and a higher-performance version finished in 1964. IDS introduced the network data model, representing relationships between records as a graph and enabling programmers to navigate these links using commands to retrieve and process individual records. Bachman described the programmer's role as a "navigator" through these interconnected structures, and IDS supported features such as data independence, metadata management, and transaction processing enhancements by 1965.[19][20] Bachman's work heavily influenced the Conference on Data Systems Languages (CODASYL) Database Task Group, which formed to standardize database approaches. The group's 1969 report defined a network model standard, building directly on IDS concepts and including data definition and manipulation languages, schemas, and security features such as privacy locks. The CODASYL network model allowed flexible many-to-many relationships through sets, with records accessed via primary keys, set navigation, or sequential scanning.[19][20] Concurrently, IBM developed the Information Management System (IMS), a hierarchical database. Development began in 1966 in collaboration with Rockwell for NASA's Apollo program needs, with the initial version shipped in 1967 and delivered to NASA in 1968. IMS organized data in tree-like structures with parent-child relationships, requiring navigation down the hierarchy to access dependent records. It was designed for high-volume transaction processing and efficient management of complex inventories, such as rocket parts tracking, and was commercially announced in 1968 for System/360 mainframes.[21][19][20] These hierarchical and network systems represented the primary navigational databases of the 1960s, addressing the shift to random-access disk storage and enabling integrated data management on mainframes.1970s: Relational model emergence
In the 1970s, the relational model fundamentally transformed database theory and practice, shifting away from the navigational approaches of the previous decade. In 1970, IBM researcher Edgar F. Codd published his landmark paper "A Relational Model of Data for Large Shared Data Banks," which introduced a structured way to organize data into tables (relations) linked by values rather than physical pointers or hierarchies. This approach emphasized data independence, allowing applications to access data without knowledge of its internal storage structure.[22][23] Building on Codd's theoretical foundation, practical implementations emerged later in the decade. In 1973, IBM launched the System R project to develop an industrial-scale relational database system. Led by researchers including Don Chamberlin and Raymond Boyce, the project demonstrated the model's viability through a working prototype. Chamberlin and Boyce also developed Structured Query Language (SQL) as part of System R, enabling users to express queries at a high level that the system could translate into efficient execution plans.[22] Concurrently, the University of California, Berkeley, initiated the INGRES project in the early 1970s as a research effort to build another early relational database system. These pioneering systems—System R at IBM and INGRES at Berkeley—validated the relational model's potential for managing large shared data banks, setting the stage for its dominance in database technology.[24][22]1980s–1990s: Desktop and object-oriented databases
In the 1980s, the proliferation of personal computers drove the development of desktop databases, which brought database management capabilities to individual users and small organizations, extending beyond large-scale enterprise systems. Ashton-Tate's dBASE, initially released in 1980 for CP/M and soon ported to the IBM PC, emerged as a leading product due to its intuitive programming language, runtime interpreter architecture, and support for custom applications, quickly becoming one of the most popular PC database tools.[25] By the mid-1980s, versions such as dBASE III (1984) and dBASE III+ (1986) enhanced usability with improved features and character-based menus, helping dBASE achieve dominant market share in PC databases during this period.[25] Other notable desktop databases included Borland's Paradox, introduced in 1985 for DOS, which provided relational capabilities and a user-friendly interface, and products like R:BASE, reflecting the growing accessibility of database tools on personal computers.[26] In the early 1990s, Microsoft Access, released in November 1992, gained rapid adoption on Windows platforms through its graphical interface, integration with Microsoft Office, and ease of use for non-programmers, contributing to the transition of desktop databases toward modern graphical environments.[26] Concurrently, object-oriented databases arose in the late 1980s and 1990s to manage complex, interconnected data structures—such as those in computer-aided design, multimedia, and engineering applications—where relational models encountered limitations in representing hierarchical or composite objects. Early prototypes included GemStone (based on Smalltalk), Vbase (using a CLU-like language), and Orion (based on CLOS), which integrated database features with object-oriented principles to minimize the impedance mismatch between programming languages and persistent storage.[27] Commercial systems like ObjectStore (from Object Design) gained traction by the early 1990s, offering native support for object identity, encapsulation, inheritance, and complex object handling.[27] The 1990 Object-Oriented Database System Manifesto defined essential characteristics of these systems, and standardization efforts followed, though the field faced ongoing debates over models, query languages, and commercial viability compared to established relational approaches.[28][27]2000s onward: NoSQL, NewSQL, and big data
In the late 2000s, the rapid growth of web-scale applications and internet-generated data exposed limitations in traditional relational databases, which were typically designed for single-server deployments and struggled to scale horizontally under massive workloads.[29] This challenge spurred the rise of NoSQL databases, which prioritized distributed architectures, flexible schemas, and high availability to manage large volumes of structured, semi-structured, or unstructured data efficiently.[30] Pioneering systems such as Google's BigTable (described in a 2006 paper) and Amazon's Dynamo (2007) demonstrated scalable, fault-tolerant designs that could span clusters of commodity servers, often trading strict consistency for partition tolerance and availability to support high-velocity applications at companies like Facebook and Twitter.[29] The emergence of NoSQL coincided with the big data era, which took shape in the mid-2000s as exploding data volumes, variety, and velocity overwhelmed conventional systems.[31] The introduction of Apache Hadoop in 2005 provided a foundational framework for distributed storage and processing of massive datasets across clusters, reinforcing the need for scalable database solutions that could handle petabyte-scale data without relying on expensive vertical scaling.[31] Big data demands accelerated adoption of distributed database designs, influencing both NoSQL's focus on horizontal scalability and the broader shift toward cloud-centric architectures that emphasized resilience and cost-effective storage.[29] In the early 2010s, NewSQL databases emerged to address NoSQL's trade-offs—such as limited ACID transaction support and lack of full SQL compatibility—while retaining relational capabilities.[32] The term NewSQL was coined in 2011 by analyst Matthew Aslett to describe systems that combined NoSQL-style scalability for online transaction processing workloads with ACID guarantees and SQL interfaces.[32] Google's Spanner (introduced in a 2012 paper) exemplified this approach by enabling globally distributed, strongly consistent transactions, paving the way for cloud-native NewSQL systems designed for geo-replication, fault tolerance, and elastic scaling in distributed environments.[29] These developments reflected ongoing efforts to balance performance, consistency, and scalability in response to big data and cloud computing demands.[33]Classification and types
Relational databases
Relational databases organize data into structured tables, where each table consists of rows and columns. Rows represent individual records or tuples, while columns define specific attributes or fields of those records. For example, a customer table might include columns for customer ID, name, and address, with each row containing data for a unique customer.[34][35] Tables in relational databases are linked through keys. A primary key is a unique identifier for each row in a table, ensuring no duplicates and enabling precise record retrieval. A foreign key in one table references the primary key in another table, establishing relationships between tables, such as linking orders to customers. These relationships allow data from multiple tables to be combined efficiently for queries and analysis.[34][36] To reduce data redundancy and enhance integrity, relational databases employ normalization, a systematic design process that organizes data into appropriately structured tables. Normalization minimizes duplication by ensuring that related data is stored in a single place rather than repeated across records, thereby improving consistency and simplifying maintenance.[34] The standard language for interacting with relational databases is Structured Query Language (SQL), a standardized programming language that enables users to create, retrieve, update, delete, and manage data. SQL supports complex operations, such as joining tables based on shared keys to generate reports or insights.[37][34] Most relational database systems adhere to ACID properties to guarantee reliable transaction processing. Atomicity ensures that transactions are completed fully or not at all; consistency maintains data validity according to defined rules; isolation prevents concurrent transactions from interfering with each other; and durability guarantees that committed changes persist even after system failures. These properties make relational databases suitable for applications requiring high reliability.[34][36][35] Relational databases remain the most common type of database in enterprise and transactional applications due to their structured organization, robust integrity mechanisms, and widespread support.[34]NoSQL databases
NoSQL databases, also known as non-relational databases, are designed to store and manage data using flexible data models that do not rely on fixed tabular schemas or the relational model. They accommodate unstructured and semi-structured data efficiently, enabling developers to adapt to evolving application requirements without predefined structures. This flexibility supports rapid iteration and handles diverse data types effectively.[30][38][39] NoSQL databases prioritize horizontal scalability, distributing data across multiple servers to manage large volumes and high throughput with minimal downtime, often through techniques such as sharding and replication. They typically follow the BASE principles (basically available, soft state, eventual consistency) rather than strict ACID compliance, though some implementations support ACID transactions. These characteristics make them well-suited for big data, real-time applications, and distributed systems.[30][4] NoSQL databases are commonly classified into four primary categories based on their data models: document, key-value, wide-column (also known as column-family or column-oriented), and graph. Each category optimizes for specific access patterns and workloads.[30][38][39] Document databases store data as semi-structured documents, typically in formats resembling JSON, BSON, or XML, where each document contains fields that can vary across records. This model supports nested structures and hierarchical data, eliminating the need for joins in many cases and enabling efficient retrieval of related information. They excel in scenarios involving content management, user profiles, catalogs, and applications with evolving schemas. Representative examples include MongoDB and Couchbase.[30][38][4] Key-value databases represent the simplest NoSQL model, associating unique keys with arbitrary values that can range from simple strings to complex objects. They provide high-performance reads and writes, often in-memory, and are highly partitionable for massive scale. This category suits caching, session management, user preferences, and high-throughput applications such as gaming or IoT. Representative examples include Redis and Amazon DynamoDB.[30][38][4] Wide-column stores organize data into rows and dynamic columns, where each row can contain different columns, allowing sparse data representation and efficient compression. They optimize for analytical queries over large datasets and support high write throughput across distributed clusters. This model is effective for big data applications, time-series data, and scenarios requiring column-level access, such as recommendation systems or fraud detection. Representative examples include Apache Cassandra, Apache HBase, and Google Bigtable.[30][39][4] Graph databases represent data as nodes (entities) connected by edges (relationships), with properties attached to both, making them ideal for highly interconnected data where relationships are central. They enable efficient traversal of complex networks for queries involving social connections, recommendations, or fraud patterns. Representative examples include Neo4j and Amazon Neptune.[30][38][39] In many scenarios, NoSQL databases complement relational databases rather than replacing them, particularly when applications prioritize scalability, flexible data handling, or performance for specific workloads over complex transactional consistency.[38][30]Other specialized types
Beyond the relational and NoSQL categories, several specialized database types address distinct data characteristics and workload requirements. Object-oriented databases store data as objects that encapsulate both attributes and behaviors, aligning closely with object-oriented programming paradigms. These databases represent information using classes, inheritance, polymorphism, and encapsulation, allowing complex data structures to be managed without the impedance mismatch often encountered when mapping application objects to relational tables. This design facilitates seamless integration with object-oriented languages and supports efficient handling of intricate relationships and hierarchies. They are particularly suited to applications involving complex data models, such as computer-aided design (CAD), multimedia systems, telecommunications networks, and scientific research domains like bioinformatics. Examples include systems like ObjectDB and GemStone/S.[40][41] Time-series databases are optimized for storing, retrieving, and analyzing time-stamped data points, such as metrics, sensor readings, events, and logs, where time is a primary dimension. They provide high ingestion throughput for sequential data streams, specialized compression tailored to timestamp patterns, lifecycle management for retention and downsampling, and efficient queries over time ranges or aggregations. These capabilities enable fast scans of large chronological datasets and support real-time monitoring alongside historical analysis. Common applications include infrastructure and application performance monitoring, Internet of Things (IoT) deployments, financial market tracking, energy grid management, and operational analytics. Prominent examples include InfluxDB, TimescaleDB, Prometheus, and TimescaleDB.[42][41] Other specialized types, such as spatial databases for geographic and location-based data, serve niche domains but are less commonly categorized separately from the primary paradigms.[43]Database models
Hierarchical and network models
The hierarchical and network models represent early navigational database approaches that dominated mainframe computing in the 1960s and 1970s, prior to the widespread adoption of relational systems. These models organize data using pointers and links for traversal, requiring applications to navigate predefined paths to access records. The hierarchical model structures data in a tree-like format, with a single root node at the top and parent-child relationships extending downward. Each child record links to exactly one parent, supporting one-to-many associations, while parents can have multiple children. This rigid structure simplifies representation of nested or subordinate data, such as organizational charts or parts inventories, but limits flexibility for complex interconnections. Data access occurs by following hierarchical paths through pointers, making retrieval efficient for anticipated queries yet requiring code changes when relationships evolve. The most influential implementation was IBM's Information Management System (IMS), developed starting in the mid-1960s to track rocket parts and manage inventory for NASA's Apollo program. IMS, with its first operational message in 1968, combined a hierarchical database with high-volume transaction processing capabilities, becoming a commercial standard for applications like banking, airline reservations, and manufacturing.[21][3][44] The network model builds on hierarchical principles but introduces greater flexibility by permitting many-to-many relationships. Records (or sets) can connect to multiple parents, forming a graph-like structure rather than a strict tree. This allows more natural representation of real-world associations, such as students enrolled in multiple courses or parts supplied by multiple vendors. Like the hierarchical model, navigation relies on pointers between records, but the additional links enable more direct access paths. The model was formalized through the Conference on Data Systems Languages (CODASYL) Database Task Group, which published initial specifications in 1969 and a major update in 1971.[3] These navigational models provided reliable, high-performance data management for their era but required intricate programming to handle queries and schema changes. They were largely superseded by relational approaches, which offered greater data independence and ad-hoc query support.Relational model
The relational model, introduced by Edgar F. Codd in 1970, represents data as relations, which are sets of tuples organized into tables where each row corresponds to a tuple and each column corresponds to an attribute drawn from a domain.[45] Relations have properties including unordered rows, distinct rows, and ordered columns corresponding to domains, though users interact with domain-unordered relationships using role names to distinguish identical domains.[45] This structure provides data independence by shielding users from internal storage representations.[45] A primary key is a domain or combination of domains whose values uniquely identify each tuple in the relation, and it is nonredundant when no component is superfluous.[45] A foreign key is a domain or combination in one relation whose values match the primary key of another relation, enabling cross-referencing without embedding entity descriptions.[45] The relational model includes basic operations for manipulating relations. Projection selects specified columns from a relation and removes duplicate rows to produce a new relation.[45] Selection (also called restriction) generates a subset of a relation by applying conditions or restricting it based on another relation, ensuring the result matches specified criteria.[45] Join combines two relations on common domains, with the natural join preserving all information from both while matching values in the shared domain to form a new relation.[45] In 1985, Codd proposed twelve rules to define what constitutes a truly relational database management system. These rules are: (1) All information is represented explicitly in tables by values. (2) Every datum is accessible using table name, primary key, and column name. (3) Null values represent missing or inapplicable information systematically. (4) The database description is stored in the relational catalog at the logical level. (5) A comprehensive data sublanguage supports data definition, manipulation, integrity, authorization, and transactions. (6) All theoretically updatable views are updatable. (7) Insert, update, and delete operations apply to sets of rows, not just single rows. (8) Applications remain unchanged by changes in physical storage or access methods. (9) Applications remain unchanged by logical changes to base tables that preserve information. (10) Integrity constraints are defined in the sublanguage and stored in the catalog. (11) The data language handles distributed data without impacting applications. (12) Low-level record-at-a-time languages cannot subvert relational integrity constraints. These criteria have influenced the design of relational systems since their proposal.Object-oriented and post-relational models
Object-oriented database models represent data as objects that encapsulate both attributes (data) and methods (behavior), directly aligning with object-oriented programming paradigms.[40] Objects are instances of classes, which define shared structures and behaviors, while supporting core principles such as encapsulation, inheritance, and polymorphism.[46] This approach enables a seamless mapping of application-level objects to persistent storage, eliminating much of the impedance mismatch encountered when using relational databases for complex domains.[40] Object-oriented databases emerged in the mid-1980s to address the shortcomings of relational models in managing complex, interrelated data structures found in applications such as computer-aided design, telecommunications, scientific simulations, and multimedia systems.[47] They provide advantages in code reusability through inheritance, reduced maintenance costs, and more intuitive representation of real-world entities compared to flat tabular structures.[46] Post-relational models refer to database approaches that provide a more general data model than the traditional relational model or extend it beyond strict limitations, sometimes termed hybrid or object-enhanced RDBMS. Examples include object-relational databases (which add object features like user-defined types to relational systems), multivalue models (allowing multiple values per field), and graph models. These support more flexible data representation for complex or non-tabular structures. In contrast, NewSQL databases are a class of modern relational systems that combine ACID compliance, strong consistency, and SQL querying with horizontal scalability and distributed architectures for high-throughput OLTP workloads.[48] They address scalability limitations of traditional relational databases while preserving transactional integrity. Representative systems include CockroachDB (strongly consistent distributed SQL) and Google Spanner (globally distributed transactions with high availability).[48][49]Database management systems
Core components and architecture
The core components of a database management system (DBMS) primarily include the query processor and storage manager (also known as the storage engine), which work together to handle data access, consistency, and protection. These components interact in a layered architecture that separates concerns for efficiency and reliability, with the query processor handling high-level operations and the storage manager managing lower-level persistence. Transaction management and security/authorization are integrated functions within these components.[50][51] The query processor is responsible for parsing, validating, optimizing, and coordinating the execution of user queries (typically in SQL or similar languages). It includes subcomponents such as the parser (which checks syntax and semantics), optimizer (which generates efficient execution plans based on statistics and cost models), and executor (which orchestrates dataflow through operators). The query processor relies on metadata from the system catalog, interacts with the storage manager to retrieve or update data, and enforces authorization checks during parsing and execution.[50][51] The storage manager manages the physical organization and access of data on disk or other persistent media. It includes access methods (such as heaps or B+-trees for indexing), buffer management (which caches pages in memory to minimize I/O), and file allocation. The storage manager ensures efficient data retrieval and updates, often using shared buffer pools, and incorporates transaction management functions to maintain consistency during modifications. It also includes subcomponents like the transaction manager (enforcing ACID properties through locking and write-ahead logging) and authorization manager (for access control).[50][51] Transaction management enforces the ACID properties (Atomicity, Consistency, Isolation, Durability) through mechanisms such as locking (via a lock manager) and recovery via write-ahead logging (WAL), where changes are logged before application to data pages. These functions are typically integrated within the storage manager and interact with the query processor to track transaction state and ensure atomic updates and crash recovery.[50][51] Authorization and security functions enforce access control and data protection, primarily integrated into the query processor (validating privileges during query parsing and execution) and storage manager (via authorization metadata in the catalog). These checks occur at connection time for credentials and during operations for privileges such as SELECT or UPDATE, often including row-level enforcement where supported.[50][51] DBMS architectures commonly adopt a client-server model, where clients connect to a database server via protocols like ODBC or JDBC (two-tier), or through intermediate application servers (multi-tier) for added scalability and security. In contrast, embedded architectures integrate the DBMS directly into applications, with no separate server process, offering simplicity for single-user or lightweight scenarios. Many modern DBMSs separate logical and physical data representations to achieve data independence, allowing changes to storage structures without affecting application views.[50][51]Essential functions
A database management system (DBMS) provides essential functions to define, manipulate, and control data within a database, while ensuring reliable multi-user access through concurrency control, recovery mechanisms, and authorization. These functions enable efficient management of structured data across applications and users.[52][53] Data definition allows the creation and modification of database structures, including schemas, tables, indexes, views, and constraints. This is typically achieved through a Data Definition Language (DDL), such as SQL commands like CREATE, ALTER, and DROP, which define the logical organization of data and enforce integrity rules such as primary keys and foreign keys. The DBMS maintains a metadata catalog (also known as the system catalog or data dictionary) that stores details about these structures for reference and validation.[52][54] Data manipulation involves operations to insert, update, delete, and retrieve data. A Data Manipulation Language (DML), commonly SQL commands such as INSERT, UPDATE, DELETE, and SELECT, enables users and applications to modify and query database content while preserving consistency. The DBMS optimizes these operations through its storage engine and query processor to handle efficient data access.[52][53][54] Data control governs access and privileges through a Data Control Language (DCL), such as GRANT and REVOKE commands in SQL, which specify what operations authorized users can perform on specific objects. This function, combined with authorization mechanisms, ensures that permissions are assigned based on roles or users, restricting unauthorized actions.[52][53] Concurrency control coordinates simultaneous access by multiple users or applications to prevent data inconsistencies or corruption. The DBMS employs techniques like locking (via a lock manager) to manage concurrent transactions, ensuring that operations do not interfere destructively. This function supports reliable multi-user environments and underpins ACID compliance by maintaining isolation and consistency during concurrent operations.[52][53] Recovery mechanisms protect data integrity after failures such as hardware crashes or software errors. The DBMS uses logging (via a log manager) to record changes, enabling rollback of incomplete transactions and restoration to a consistent state. Backup utilities support full, incremental, or differential backups, while recovery tools allow restoration with minimal data loss.[52][54]Database languages
SQL
Structured Query Language (SQL) is a standardized programming language designed for managing and querying data in relational database management systems (RDBMS). It allows users to create, retrieve, update, delete, and control access to structured data organized in tables with rows and columns linked by keys.[55] SQL is declarative, specifying desired results rather than procedural steps, making it accessible for data operations across systems like IBM Db2, PostgreSQL, Oracle Database, and Microsoft SQL Server.[55] SQL was originally developed in the 1970s by IBM researchers and standardized by the American National Standards Institute (ANSI) in 1986 and the International Organization for Standardization (ISO) in 1987, with the current version being ISO/IEC 9075.[37] These ANSI/ISO standards define core syntax and commands to promote compatibility across database platforms, though vendors add proprietary extensions (such as T-SQL or PL/SQL) while maintaining compliance with core elements.[55] SQL commands are categorized into several groups, including Data Definition Language (DDL), Data Manipulation Language (DML), Data Control Language (DCL), and Transaction Control Language (TCL). DDL manages database structures and objects such as tables, views, and indexes, using commands like CREATE, ALTER, DROP, and TRUNCATE. For example:CREATE TABLE products (
product_id INT PRIMARY KEY,
name VARCHAR(100),
price DECIMAL(10, 2)
);
[55]
DML handles data content within those structures through commands such as INSERT, UPDATE, and DELETE. Examples include:
INSERT INTO customers (name, email, city)
VALUES ('Jane Doe', '[email protected]', 'Los Angeles');
UPDATE customers
SET email = '[email protected]'
WHERE name = 'John Doe';
[55]
DCL governs access and permissions with commands like GRANT and REVOKE, enabling administrators to control user privileges on database objects.[37]
TCL manages transactions to ensure data consistency, using commands such as COMMIT, ROLLBACK, and SAVEPOINT to confirm or undo changes made by DML operations.[37]
The primary querying mechanism in SQL is the SELECT statement, which retrieves data from one or more tables, often combined with clauses like WHERE for filtering, ORDER BY for sorting, GROUP BY for grouping, and HAVING for filtering aggregated results. SELECT supports combining data across tables via JOIN operations, which link rows based on related columns. Common join types include INNER JOIN (matching rows only), LEFT JOIN (all rows from the left table), RIGHT JOIN, and FULL JOIN. For example:
SELECT c.name, p.name AS product_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE c.city = 'New York';
[55]
Subqueries (nested queries) return data used by an outer query, appearing in clauses such as WHERE, SELECT, FROM, or HAVING. Types include single-value subqueries (returning one value, often with aggregates), list subqueries (using IN/NOT IN), and existence subqueries (using EXISTS/NOT EXISTS). For example:
SELECT name
FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders WHERE product_id = 1);
Many subqueries can be equivalently expressed as joins, though performance may vary by context.[56]
Aggregation applies functions like COUNT, SUM, AVG, MIN, and MAX to compute summaries over data sets, typically with GROUP BY to partition results. For example:
SELECT city, COUNT(*)
FROM customers
GROUP BY city;
These features enable efficient querying, data combination, and analysis while adhering to SQL's standardized foundation.[55]
Non-SQL query languages
Non-SQL query languages are diverse tools used primarily in non-relational databases to query and manipulate data, tailored to specific data models such as document, graph, key-value, or columnar stores. Unlike SQL, which is standardized for relational databases, these languages vary widely in syntax, expressiveness, and paradigm, reflecting the flexibility and specialization of NoSQL systems.[30][38] A fundamental distinction among query languages is between declarative and imperative styles. Declarative languages specify what data is desired, leaving the how of execution—such as traversal paths or optimization—to the database engine, enabling better optimization and often simpler queries. Imperative languages specify how to perform the query through explicit step-by-step instructions, offering fine-grained control but requiring deeper knowledge of the underlying implementation and increasing the risk of errors.[57] Many non-SQL query languages adopt a declarative approach. For document-oriented databases, MongoDB employs the MongoDB Query Language (MQL), which uses JSON-like syntax to filter, project, and aggregate nested documents directly, avoiding the need for joins common in relational systems.[30] For graph databases, Neo4j's Cypher is a declarative language focused on pattern matching over nodes and relationships, allowing concise expressions of complex traversals.[57] Apache TinkerPop's Gremlin supports graph traversal with both declarative and imperative features, enabling procedural control when needed.[57][38] In contrast, key-value stores typically rely on imperative APIs rather than full query languages. Systems like Amazon DynamoDB or Redis use simple operations such asget(key), put(key, value), or set key value to access or update data by key, providing low-level, procedural interaction suited to high-speed lookups.[38]
Other examples include SPARQL for RDF-based graph data, which is declarative and focuses on triple patterns, and specialized languages like AQL in multi-model systems such as ArangoDB. These alternatives highlight the adaptation of query paradigms to non-tabular data structures, prioritizing performance, scalability, or expressiveness for specific use cases.[57][58]
Storage and implementation
Physical storage structures
In database systems, data is persistently stored on secondary storage devices, such as hard disks or solid-state drives. The fundamental unit of storage and data transfer between disk and main memory is a fixed-size block or page, typically ranging from a few kilobytes (e.g., 4 KB to 16 KB) depending on the system.[59][60] Database files are organized as collections of these pages, managed by the DBMS storage manager, which handles allocation, free space tracking, and mapping of logical page identifiers to physical disk locations. Pages may contain tuples (records), metadata, or index entries, often using a slotted page layout where a header and slot array track record locations and free space to support variable-length records.[59][61] A common physical organization is the heap file, in which records are stored in no particular order, typically appended to available space in pages or inserted into free slots within existing pages. Heap files support fast insertions but require full scans or auxiliary structures for efficient retrieval, as records have no inherent ordering.[59][62][60] Other file organizations include sequential files, where records are maintained in sorted order based on a key, and hash-based files, where a hash function determines the target page or bucket for each record to enable direct access via equality lookups.[60][62] To enable fast access without scanning entire files, databases use indexing structures that map key values to record locations. Common structures include B-trees (often B+ trees), which maintain sorted keys in a balanced tree for efficient equality and range queries, and hash indexes, which apply a hash function to distribute keys and provide constant-time average-case lookups for equality conditions. Indexes are stored separately or integrated with the data and reference records via physical pointers (such as page identifiers and slot numbers).[59][60][63]Advanced features (replication, materialized views)
Advanced features in database storage and implementation include replication and materialized views, which address challenges in availability, scalability, fault tolerance, and query performance. Database replication maintains multiple copies of data across servers to enhance high availability and scalability.[64] It ensures continued operation if a server fails, as replicas can serve requests, and distributes read workloads to improve throughput.[65] Replication supports fault tolerance and disaster recovery by providing redundant copies, reducing downtime and data loss risks.[65] Common models include master-slave replication, where a primary server handles writes and propagates changes to secondary servers, and multi-master replication, allowing writes across nodes with mechanisms for conflict resolution.[64] Synchronous replication commits transactions only after updates propagate to replicas, ensuring strong consistency at the cost of potential latency, while asynchronous replication prioritizes performance but may introduce temporary inconsistencies.[64] Materialized views store precomputed query results as physical tables to accelerate access for complex or repeated queries.[66] Unlike standard views, which compute results dynamically from source tables, materialized views hold data directly, reducing execution time for joins, aggregations, or transformations.[67] They improve performance in reporting, dashboards, or analytical workloads by eliminating repeated computations and enabling indexing on the stored data.[67] Maintenance involves refreshing the view—either fully, incrementally, or on-demand—when underlying data changes, trading some overhead for faster queries.[66] This approach suits scenarios where source data is not optimally formatted or where queries are resource-intensive.[67]Transactions and concurrency
ACID properties
ACID (Atomicity, Consistency, Isolation, Durability) refers to a set of four key properties that database transactions must satisfy to ensure reliable processing, particularly in the presence of errors, system failures, or concurrent operations. These properties guarantee that database operations maintain data integrity and validity even under challenging conditions.[68][69] Atomicity ensures that a transaction is treated as an indivisible unit: either all operations within the transaction are successfully completed and applied, or none are applied at all. If any part of the transaction fails, the entire transaction is rolled back, leaving the database unchanged. This "all-or-nothing" property prevents partial updates that could leave the system in an inconsistent state. For example, in a bank fund transfer, both the debit from one account and the credit to another must occur together, or neither occurs.[69][70] Consistency guarantees that a transaction brings the database from one valid state to another valid state, preserving all predefined rules, constraints, triggers, and data integrity requirements. The transaction must adhere to the database's integrity constraints, ensuring that only legal changes are applied. In the bank transfer example, consistency ensures that the total sum of funds across accounts remains unchanged before and after the transaction.[68][70] Isolation ensures that concurrent transactions do not interfere with each other. Each transaction executes as if it were the only one running, even when multiple transactions occur simultaneously. Intermediate states of a transaction are not visible to others, making concurrent transactions appear to execute serially. In the fund transfer scenario, another transaction would see funds either fully in the source account or fully in the destination account, but never in an intermediate state.[68][69] Durability guarantees that once a transaction is committed, its changes are permanently saved and survive subsequent system failures, such as crashes or power outages. Committed data is persisted to non-volatile storage, typically using techniques like write-ahead logging, so that changes are recoverable upon restart. In the transfer example, once committed, the updated account balances persist even if the system fails immediately afterward.[68][70] These properties, pioneered by Jim Gray in his foundational work on transaction processing and later formalized as the ACID acronym, collectively ensure that database transactions remain reliable, protecting against data corruption and loss in mission-critical applications.[68]Concurrency control mechanisms
Concurrency control mechanisms enable multiple transactions to access shared data simultaneously while preserving consistency and preventing interference that could violate isolation guarantees. The primary approaches are pessimistic concurrency control, which assumes conflicts are likely and prevents them proactively, and optimistic concurrency control, which assumes conflicts are rare and detects them reactively. Pessimistic concurrency control relies on locking: transactions acquire locks on data items before accessing them. Shared locks (read locks) allow multiple transactions to read but not modify the data, while exclusive locks (write locks) grant sole access for modification. Two-phase locking (2PL) is a common protocol, requiring transactions to acquire all locks before releasing any (growing phase followed by shrinking phase), ensuring serializable schedules. Strict 2PL, a variant, holds all locks until commit or abort, preventing cascading aborts.[71] Multi-version concurrency control (MVCC), often used in optimistic approaches, maintains multiple versions of data items. Each transaction reads from a consistent snapshot based on its start time or a global timestamp, avoiding many read-write conflicts. Writers create new versions without blocking readers, who access prior committed versions. This enhances concurrency in read-heavy workloads, though writers may still use locking or validation. Databases like PostgreSQL implement MVCC to support isolation levels with minimal blocking, using snapshots for consistency and predicate locking in stricter modes to detect dependencies without blocking.[72] Isolation levels define the trade-off between concurrency and protection against anomalies like dirty reads (reading uncommitted data), non-repeatable reads (re-reading changed committed data), and phantom reads (seeing new rows from concurrent inserts). The ANSI/ISO SQL standard defines four levels:- Read Uncommitted: Allows dirty reads; no anomalies are prevented.
- Read Committed: Prevents dirty reads but allows non-repeatable reads and phantom reads.
- Repeatable Read: Prevents dirty reads and non-repeatable reads but allows phantom reads in the standard (though some implementations like PostgreSQL prevent them via MVCC).
- Serializable: Prevents all three anomalies and serialization anomalies, ensuring equivalent serial execution.
Security
Access control and authentication
Access control in databases consists of authentication and authorization mechanisms to regulate who can connect to the system and what operations they may perform. Authentication verifies the identity of a user attempting to access the database, while authorization determines the specific privileges granted to that user after successful authentication.[75] Common authentication methods include username and password combinations, where the database management system verifies credentials against stored values.[76] More advanced approaches integrate with external directory services, such as Active Directory for Windows environments or LDAP for cross-platform centralized management.[76] Single sign-on (SSO) protocols, such as SAML or OpenID Connect, enable users to authenticate once across multiple systems, reducing credential management overhead.[76] Many modern databases support or recommend multi-factor authentication (MFA), requiring additional verification factors like one-time codes or biometrics to strengthen security against credential compromise.[76] Authorization in databases is most commonly implemented through role-based access control (RBAC), where privileges are assigned to roles rather than directly to individual users, and users are then assigned to those roles.[77] This approach simplifies management by allowing privilege inheritance through role hierarchies and supports scalable administration in large environments. Some systems also incorporate discretionary access control (DAC), in which the owner of an object can grant or revoke access to it.[77] In relational databases adhering to SQL standards, authorization is managed using the GRANT and REVOKE statements. The GRANT command assigns specific privileges—such as SELECT, INSERT, UPDATE, or DELETE—on securable objects like tables, views, or schemas to users or roles. For example, GRANT SELECT ON table TO user allows the specified user to read from that table, while WITH GRANT OPTION permits the grantee to further delegate the privilege.[78] The REVOKE statement removes previously granted privileges, such as REVOKE SELECT ON table FROM user, ensuring fine-grained control over access.[79] Privileges can be managed at various scopes, and in many implementations, only object owners or users with elevated privileges (such as those holding CONTROL or MANAGE GRANTS) can execute these commands.[78][77]Encryption and protection measures
Databases employ multiple layers of protection to safeguard sensitive data against unauthorized access, breaches, and misuse, complementing access control mechanisms by focusing on encryption, masking, auditing, and network-level defenses. Encryption protects data at rest by converting stored database files and logs into an unreadable format that requires decryption keys for access, rendering the data unusable if storage media is compromised or stolen. Techniques such as Transparent Data Encryption (TDE) encrypt entire database files at the storage level with minimal performance overhead, while column-level encryption targets specific sensitive columns using symmetric keys.[80][81][82] Encryption in transit secures data during transmission between clients and the database server using protocols like Transport Layer Security (TLS), ensuring that intercepted communications remain confidential and tamper-resistant. This typically requires TLS 1.2 or higher for connections, preventing eavesdropping or man-in-the-middle attacks over networks.[80][81] Data masking obscures sensitive information without altering the underlying data structure, enabling safe use in non-production environments such as testing or development. Techniques include substitution, redaction, or dynamic masking that hides data from non-privileged users during queries, reducing exposure risks while maintaining functional usability of datasets.[81][82][80] Auditing monitors and logs database activities, including user actions, queries, and access attempts, to detect anomalies, support compliance, and enable forensic analysis. Audit logs capture events in real time or are stored centrally for review, with configurable policies for alerting on suspicious behavior such as unauthorized modifications.[83][82][80] Database firewalls filter and monitor traffic to the database, blocking malicious SQL statements like injection attempts and enforcing policies on allowed queries or connections. These firewalls operate at the network or application layer, inspecting SQL traffic in real time to prevent exploits and unauthorized commands while logging violations for further investigation.[82][81][83]Design and modeling
Data modeling approaches
Data modeling approaches provide structured methodologies for representing the structure, relationships, and constraints of data during the conceptual and logical design phases of database development. These approaches help translate real-world requirements into organized data structures while promoting data integrity, minimizing redundancy, and facilitating future modifications. The Entity-Relationship (ER) model, introduced by Peter Chen in 1976, is a foundational conceptual data modeling technique. It represents real-world objects as entities, their properties as attributes, and associations among them as relationships, incorporating semantic information to create a unified view of data independent of implementation details. This model supports high-level design by capturing key semantic aspects of the domain, making it widely used for initial database conceptualization.[84][85] The Unified Modeling Language (UML), particularly through extensions to class diagrams, offers another approach for data modeling, often applied to relational databases. In UML data modeling profiles, classes represent tables, attributes represent columns, and stereotypes such as <>, <- Second normal form (2NF) builds on 1NF by removing partial dependencies, where non-key attributes depend on only part of a composite primary key.
- Third normal form (3NF) eliminates transitive dependencies, ensuring non-key attributes depend only on the primary key.
- Boyce-Codd normal form (BCNF) strengthens 3NF by requiring that every determinant (attribute on which another depends) is a candidate key.
- Fourth normal form (4NF) resolves multivalued dependencies, preventing independent multiple associations in a relation.
- Fifth normal form (5NF) addresses join dependencies, ensuring lossless decomposition and no spurious tuples after joins.
