Recent from talks
Contribute something
Nothing was collected or created yet.
Data store
View on WikipediaA data store is a repository for persistently storing and managing collections of data which include not just repositories like databases, but also simpler store types such as simple files, emails, etc.[1]
A database is a collection of data that is managed by a database management system (DBMS), though the term can sometime more generally refer to any collection of data that is stored and accessed electronically. A file is a series of bytes that is managed by a file system. Thus, any database or file is a series of bytes that, once stored, is called a data store.
MATLAB[2] and Cloud Storage systems like VMware,[3] Firefox OS[4] use datastore as a term for abstracting collections of data inside their respective applications.
Types
[edit]Data store can refer to a broad class of storage systems including:
- Paper files
- Simple files like a spreadsheet
- File systems
- Email storage systems (both server and client systems)
- Databases
- Relational databases, based on the relational model of data
- Object-oriented databases. They can save objects of an object-oriented design.
- NoSQL databases
- Distributed data stores
- Directory services
- VMware uses the term datastore to refer to a file that stores a virtual machine[5]
See also
[edit]References
[edit]- ^ "Glossary D: data store". Information Management. Archived from the original on 2013-01-14. Retrieved 2011-04-04.
A place where data is stored; data at rest. A generic term that includes databases and flat files.
- ^ "Datastore - MATLAB & Simulink". in.mathworks.com. Retrieved 2016-01-11.[permanent dead link]
- ^ VMware (2016-01-11). "Managed Object - Datastore". VMware. Archived from the original on 2016-03-04. Retrieved 2016-01-11.
- ^ "Data Store API". Mozilla Developer Network. Archived from the original on 2014-12-23. Retrieved 2016-01-11.
- ^ "Managed Object Description". Pubs.vmware.com. Archived from the original on 2019-07-08. Retrieved 2019-07-02.
Data store
View on GrokipediaFundamentals
Definition and Scope
A data store is a repository for persistently storing, retrieving, and managing collections of data in structured or unstructured formats.[1] It functions as a digital storehouse that retains data across system restarts or power interruptions, contrasting with transient storage like RAM, which loses information upon shutdown.[5] This persistence ensures data availability for ongoing operations in computing environments.[6] The scope of data stores extends beyond simple hardware to managed collections, encompassing databases, file systems, object stores, and archives such as email systems.[7] These systems organize raw bytes into logical units like records, files, or objects to facilitate efficient access and manipulation.[8] For example, MATLAB's datastore offers an abstract interface for treating large, distributed datasets—spanning disks, remote locations, or databases—as a single, cohesive entity.[9] In information systems, data stores play a central role by enabling the preservation and utilization of data sets for organizational purposes, including analysis and decision-making.[7] They include diverse forms, such as relational and non-relational variants, to accommodate varying data management requirements.[6]Key Characteristics
Data stores are designed to ensure durability, which refers to the ability to preserve data integrity and availability even in the face of hardware failures, power outages, or other disruptions. This is typically achieved through mechanisms such as data replication, where copies of data are maintained across multiple storage nodes to prevent loss, and regular backups that create point-in-time snapshots for recovery. For instance, replication can be synchronous or asynchronous, ensuring that data remains intact and recoverable without corruption.[10][11] Scalability is a core attribute allowing data stores to handle growing volumes of data and user demands efficiently. Vertical scaling involves upgrading the resources of a single server, such as adding more CPU or memory, to improve capacity, while horizontal scaling distributes the load across multiple nodes, often using techniques like sharding to partition data into subsets stored on different servers. Sharding enhances horizontal scalability by enabling linear growth in storage and processing power as shards are added.[12][13] Accessibility in data stores is facilitated through support for fundamental CRUD operations—Create, Read, Update, and Delete—which allow users or applications to interact with stored data programmatically. These operations are exposed via APIs, such as RESTful interfaces, or query languages like SQL, enabling seamless data manipulation from remote or local clients. This design ensures that data can be retrieved, modified, or inserted reliably across distributed environments.[14][1] Security features are integral to protecting data from unauthorized access and breaches. Encryption at rest safeguards stored data by rendering it unreadable without decryption keys, while encryption in transit protects data during transmission over networks using protocols like TLS. Access controls, such as role-based access control (RBAC), limit permissions to authorized users, and auditing mechanisms log all data interactions to detect and investigate potential violations.[15][16] Performance in data stores is evaluated through metrics like latency, which measures the time to respond to requests, and throughput, which indicates the volume of operations processed per unit time. These are influenced by consistency models, where strong consistency ensures all reads reflect the most recent writes across replicas, providing immediate accuracy but potentially at the cost of availability. In contrast, eventual consistency allows temporary discrepancies, with replicas converging over time, often prioritizing higher throughput in distributed systems. The CAP theorem formalizes trade-offs in distributed data stores, stating that only two of three properties—consistency, availability, and partition tolerance—can be guaranteed simultaneously during network partitions.[1][17][18]Historical Development
Origins in Computing
The concept of organized data storage predates digital computing, with manual ledgers and filing systems serving as foundational analogs for structuring and retrieving information. In ancient Mesopotamia around 4000 BCE, clay tablets were used for recording transactions, evolving into paper-based ledgers during the Renaissance, where double-entry bookkeeping, formalized by Luca Pacioli in 1494, enabled systematic tracking of financial data.[19] By the 19th and early 20th centuries, filing cabinets emerged as a key infrastructure for document management in offices and bureaucracies, allowing hierarchical organization of records by category or date to facilitate access and maintenance.[20] The advent of electronic computers in the 1940s introduced the first digital mechanisms for data persistence, building on these analog precedents. The ENIAC, completed in 1945, relied on punch cards for input and limited internal storage via vacuum tubes and function tables, marking an initial shift from manual to machine-readable data handling.[21] In the early 1950s, the UNIVAC I, delivered in 1951, advanced this further by incorporating magnetic tapes as a primary storage medium, enabling sequential data access at speeds far exceeding punch cards and supporting commercial data processing for the U.S. Census Bureau.[22] These tapes, 0.5 inches wide and coated with iron oxide, stored up to 2 million characters per reel, replacing bulky card stacks and laying groundwork for scalable data retention.[23] By the 1960s, operating systems began integrating structured file management, with Multics, initiated in 1965 by MIT, Bell Labs, and General Electric, pioneering the first hierarchical file system. This tree-like structure organized files into directories of unlimited depth, allowing users to navigate data via paths rather than flat lists, influencing subsequent systems like Unix.[24] Concurrently, Charles Bachman's Integrated Data Store (IDS), developed at General Electric starting in 1960, represented one of the earliest database models, employing a navigational approach with linked records for direct-access storage on disk, which earned Bachman the 1973 Turing Award for its innovations in data management.[25] Key milestones included IBM's Information Management System (IMS) in 1968, a hierarchical database designed for the Apollo program, which structured data as parent-child trees to handle complex relationships efficiently on System/360 mainframes.[26] The CODASYL Data Base Task Group, formed in the late 1960s, further standardized network databases through its 1971 report, extending Bachman's IDS concepts to allow many-to-many record linkages via pointers, promoting interoperability across systems.[27] These developments set the stage for the relational model introduced in the 1970s.Evolution to Modern Systems
The evolution of data stores from the 1970s marked a shift toward structured, scalable systems driven by the need for efficient data management in growing computational environments. In 1970, E.F. Codd introduced the relational model in his seminal paper, proposing a data structure based on relations (tables) with keys to ensure integrity and enable declarative querying, which laid the foundation for modern relational database management systems (RDBMS). This model addressed limitations of earlier hierarchical and network models by emphasizing data independence and normalization. By 1974, IBM researchers Donald D. Chamberlin and Raymond F. Boyce developed SEQUEL (later SQL), a structured English query language for accessing relational data, which became the standard for database interactions. The commercial viability of these innovations emerged in 1979 with the release of Oracle, the first commercially available SQL-based RDBMS, enabling widespread adoption in enterprise settings. The 1980s and 1990s saw data stores adapt to distributed computing and analytical needs, transitioning from mainframe-centric systems to more flexible architectures. The rise of personal computers spurred client-server architectures in the 1980s, where database servers handled storage and processing while clients managed user interfaces, improving scalability and accessibility over monolithic systems.[28] Concurrently, object-oriented database management systems (OODBMS) emerged in the late 1980s to bridge relational rigidity with object-oriented programming paradigms, supporting complex data types like multimedia and hierarchies directly in the database, as exemplified by systems like GemStone. Into the 1990s, data warehousing gained prominence with the introduction of online analytical processing (OLAP) by E.F. Codd in 1993, enabling multidimensional data analysis for business intelligence through cube structures and aggregation, which complemented transactional OLTP systems.[29] The 2000s ushered in the big data era, propelled by internet-scale applications and the limitations of traditional RDBMS in handling volume, velocity, and variety. In 2006, Google published the Bigtable paper, describing a distributed, scalable NoSQL storage system built on columnar data for managing petabyte-scale datasets across commodity hardware. That same year, Amazon introduced Dynamo, a highly available key-value store emphasizing eventual consistency and fault tolerance for e-commerce workloads, influencing subsequent distributed systems. Also in 2006, the Apache Hadoop framework was released, providing an open-source implementation of MapReduce for parallel processing and HDFS for fault-tolerant storage, democratizing big data handling beyond proprietary solutions. Complementing these, Amazon Simple Storage Service (S3) launched in 2006 as a cloud-native object store, offering durable, scalable storage for unstructured data without managing infrastructure. From the 2010s to the 2020s, data stores evolved toward cloud-native, polyglot, and AI-integrated designs to meet demands for elasticity, versatility, and intelligence. Serverless architectures gained traction in the mid-2010s, with offerings like Amazon Aurora Serverless in 2017 automating scaling and provisioning for relational workloads, reducing operational overhead in dynamic environments. Multi-model databases emerged around 2012, supporting diverse models (e.g., relational, document, graph) within a unified backend to simplify polyglot persistence, as surveyed in works on handling data variety.[30] In the 2020s, integration with AI and machine learning accelerated, particularly through vector databases optimized for similarity search on embeddings, rising post-2020 to power generative AI applications like retrieval-augmented generation. As of 2025, advancements include enhanced security features in cloud data platforms.Classification and Types
Relational and SQL-Based Stores
Relational data stores, also known as relational database management systems (RDBMS), organize data into structured tables consisting of rows (tuples) and columns (attributes), where each row represents an entity and columns define its properties. This tabular model, introduced by Edgar F. Codd in 1970, allows for the representation of complex relationships between data entities through the use of keys. A primary key uniquely identifies each row in a table, while a foreign key in one table references the primary key in another, establishing links that maintain referential integrity across the database.[31] To minimize data redundancy and ensure consistency, relational stores employ normalization, a process that structures data according to specific normal forms. First Normal Form (1NF) requires that all attributes contain atomic values, eliminating repeating groups and ensuring each table row is unique. Second Normal Form (2NF) builds on 1NF by removing partial dependencies, where non-key attributes depend only on the entire primary key, not part of it. Third Normal Form (3NF) further eliminates transitive dependencies, ensuring non-key attributes depend solely on the primary key and not on other non-key attributes. These forms, formalized by Codd in 1972, reduce anomalies during data operations like insertions, updates, or deletions.[32] The primary query language for relational stores is Structured Query Language (SQL), a declarative language developed by IBM researchers in the 1970s for the System R prototype and standardized by ANSI in 1986. SQL enables users to retrieve and manipulate data without specifying how to perform operations. For example, a basic SELECT statement retrieves specific columns from a table:SELECT column1, column2
FROM table_name
WHERE condition;
SELECT column1, column2
FROM table_name
WHERE condition;
SELECT customers.name, orders.amount
FROM customers
INNER JOIN orders ON customers.id = orders.customer_id;
SELECT customers.name, orders.amount
FROM customers
INNER JOIN orders ON customers.id = orders.customer_id;
SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;
SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;
