Data architecture
View on WikipediaThis article needs additional citations for verification. (November 2008) |
Data architecture consist of models, policies, rules, and standards that govern which data is collected and how it is stored, arranged, integrated, and put to use in data systems and in organizations.[1] Data is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture.[2]
Overview
[edit]A data architecture aims to set data standards for all its data systems as a vision or a model of the eventual interactions between those data systems. Data integration, for example, should be dependent upon data architecture standards since data integration requires data interactions between two or more data systems. A data architecture, in part, describes the data structures used by a business and its computer applications software. Data architectures address data in storage, data in use, and data in motion; descriptions of data stores, data groups, and data items; and mappings of those data artifacts to data qualities, applications, locations, etc.
Essential to realizing the target state, data architecture describes how data is processed, stored, and used in an information system. It provides criteria for data processing operations to make it possible to design data flows and also control the flow of data in the system.
The data architect is typically responsible for defining the target state, aligning during development and then following up to ensure enhancements are done in the spirit of the original blueprint.
During the definition of the target state, the data architecture breaks a subject down to the atomic level and then builds it back up to the desired form. The data architect breaks the subject down by going through three traditional architectural stages:
- Conceptual - represents all business entities.
- Logical - represents the logic of how entities are related.
- Physical - the realization of the data mechanisms for a specific type of functionality.
The "data" column of the Zachman Framework for enterprise architecture –
| Layer | View | Data (what) | Stakeholder |
|---|---|---|---|
| 1 | Scope/contextual | List of things and architectural standards[3] important to the business | Planner |
| 2 | Business model/conceptual | Semantic model or conceptual/enterprise data model | Owner |
| 3 | System model/logical | Enterprise/logical data model | Designer |
| 4 | Technology model/physical | Physical data model | Builder |
| 5 | Detailed representations | Actual databases | Developer |
In this second, broader sense, data architecture includes a complete analysis of the relationships among an organization's functions, available technologies, and data types.
Data architecture should be defined in the planning phase of the design of a new data processing and storage system. The major types and sources of data necessary to support an enterprise should be identified in a manner that is complete, consistent, and understandable. The primary requirement at this stage is to define all of the relevant data entities, not to specify computer hardware items. A data entity is any real or abstract thing about which an organization or individual wishes to store data.
Physical data architecture
[edit]Physical data architecture of an information system is part of a technology plan. The technology plan is focused on the actual tangible elements to be used in the implementation of the data architecture design. Physical data architecture encompasses database architecture. Database architecture is a schema of the actual database technology that would support the designed data architecture.
Elements of data architecture
[edit]Certain elements must be defined during the design phase of the data architecture schema. For example, an administrative structure that is to be established in order to manage the data resources must be described. Also, the methodologies that are to be employed to store the data must be defined. In addition, a description of the database technology to be employed must be generated, as well as a description of the processes that are to manipulate the data. It is also important to design interfaces to the data by other systems, as well as a design for the infrastructure that is to support common data operations (i.e. emergency procedures, data imports, data backups, external transfers of data).
Without the guidance of a properly implemented data architecture design, common data operations might be implemented in different ways, rendering it difficult to understand and control the flow of data within such systems. This sort of fragmentation is undesirable due to the potential increased cost and the data disconnects involved. These sorts of difficulties may be encountered with rapidly growing enterprises and also enterprises that service different lines of business.
Properly executed, the data architecture phase of information system planning forces an organization to specify and describe both internal and external information flows. These are patterns that the organization may not have previously taken the time to conceptualize. It is therefore possible at this stage to identify costly information shortfalls, disconnects between departments, and disconnects between organizational systems that may not have been evident before the data architecture analysis.[4]
Constraints and influences
[edit]Various constraints and influences will have an effect on data architecture design. These include enterprise requirements, technology drivers, economics, business policies and data processing needs.
- Enterprise requirements
- These generally include such elements as economical and effective system expansion, acceptable performance levels (especially system access speed), transaction reliability, and transparent data management. In addition, the conversion of raw data such as transaction records and image files into more useful information forms through such features as data warehouses is also a common organizational requirement, since this enables managerial decision making and other organizational processes. One of the architecture techniques is the split between managing transaction data and (master) reference data. Another is splitting data capture systems from data retrieval systems (as done in a data warehouse).
- Technology drivers
- These are usually suggested by the completed data architecture and database architecture designs. In addition, some technology drivers will derive from existing organizational integration frameworks and standards, organizational economics, and existing site resources (e.g. previously purchased software licensing). In many cases, the integration of multiple legacy systems requires the use of data virtualization technologies.
- Economics
- These are also important factors that must be considered during the data architecture phase. It is possible that some solutions, while optimal in principle, may not be potential candidates due to their cost. External factors such as the business cycle, interest rates, market conditions, and legal considerations could all have an effect on decisions relevant to data architecture.
- Business policies
- Business policies that also drive data architecture design include internal organizational policies, rules of regulatory bodies, professional standards, and applicable governmental laws that can vary by applicable agency. These policies and rules describe the manner in which the enterprise wishes to process its data.
- Data processing needs
- These include accurate and reproducible transactions performed in high volumes, data warehousing for the support of management information systems (and potential data mining), repetitive periodic reporting, ad hoc reporting, and support of various organizational initiatives as required (i.e. annual budgets, new product development).
See also
[edit]- Controlled vocabulary
- Data mesh, a domain-oriented data architecture
- Disparate system
- Enterprise information security architecture (EISA) – positions data security in the enterprise information framework.
- FDIC Enterprise Architecture Framework
- Information silo
- TOGAF
References
[edit]- ^ Business Dictionary - Data Architecture Archived 2013-03-30 at the Wayback Machine; TOGAF 9.1 - Phase C: Information Systems Architectures - Data Architecture
- ^ What is data architecture GeekInterview, 2008-01-28, accessed 2011-04-28
- ^ Data Architecture Standards
- ^ Mittal, Prashant (2009). Author. pg 256: Global India Publications. p. 314. ISBN 978-93-8022-820-4.
{{cite book}}: CS1 maint: location (link)
Further reading
[edit]- Bass, L.; John, B.; & Kates, J. (2001). Achieving Usability Through Software Architecture, Carnegie Mellon University.
- Lewis, G.; Comella-Dorda, S.; Place, P.; Plakosh, D.; & Seacord, R., (2001). Enterprise Information System Data Architecture Guide Carnegie Mellon University.
- Adleman, S.; Moss, L.; Abai, M. (2005). Data Strategy Addison-Wesley Professional.
External links
[edit]- Achieving Usability Through Software Architecture, sei.cmu.edu 2001
- The Logical Data Architecture, by Nirmal Baid
- Building a modern data and analytics architecture
- The “Right to Repair” Data Architecture with DataOps, the DataOps Blog
- TOGAF 9: Preparation Process
Data architecture
View on GrokipediaFundamentals
Definition and Scope
Data architecture is the practice of designing, creating, deploying, and managing an organization's data assets to meet current and future business requirements, encompassing the structures, models, processes, and standards that govern data storage, access, integration, and utilization.[6] It provides a blueprint for how data is collected, organized, processed, and consumed to support operational efficiency and strategic objectives.[7] This discipline ensures that data flows seamlessly across systems while maintaining quality, security, and compliance.[1] The scope of data architecture extends across the entire data lifecycle, from initial creation and collection through processing, storage, usage, and eventual archival or disposal, distinguishing it from narrower fields like database design, which focuses primarily on the implementation details of specific data storage solutions such as schema creation and query optimization.[8][9] Unlike enterprise architecture, which addresses the broader integration of IT systems, applications, and business processes, data architecture specifically targets the data layer to align with organizational goals without encompassing non-data elements like hardware infrastructure or application logic.[10][11] Central to data architecture is the recognition of data as a strategic asset, treated with the same rigor as financial or physical resources to maximize its value and minimize risks.[12] It emphasizes alignment with business strategy, ensuring that data practices enable advanced analytics, informed decision-making, and competitive advantage by providing reliable, accessible information for stakeholders.[4] Key foundational terms include data domains, which categorize information by business function; for instance, master data refers to core, stable entities such as customers or products that provide context for operations, while transactional data captures dynamic records of business events like orders or payments.[13] These concepts lay the groundwork for higher-level architectural approaches, including conceptual, logical, and physical views of data.[6]Historical Development
The development of data architecture began in the 1960s with the advent of mainframe computing, where early database systems focused on hierarchical structures to manage complex data for large-scale projects. IBM's Information Management System (IMS), initially designed in 1966 as part of the Apollo space program in collaboration with NASA, represented a pivotal milestone as one of the first hierarchical database management systems, organizing data in tree-like parent-child relationships to support transaction processing.[14] Released in 1968 and renamed IMS/360 in 1969, it enabled efficient navigation of structured data but was tightly coupled to application programs, limiting flexibility.[15] The 1970s marked a paradigm shift from hierarchical and file-based systems to the relational model, fundamentally altering data organization and access. In 1970, Edgar F. Codd, an IBM researcher, published "A Relational Model of Data for Large Shared Data Banks" in Communications of the ACM, introducing tables (relations) connected via keys, relational algebra for operations, and normalization techniques to minimize redundancy and ensure data integrity.[16] This model decoupled data from applications, promoting independence and scalability. Mid-decade, the ANSI/SPARC committee formalized the three-schema architecture in 1975, proposing external (user views), conceptual (logical structure), and internal (physical storage) levels to further enhance data abstraction and portability across systems. By 1985, Codd expanded on relational principles with his 12 rules (including a zeroth rule on foundational support for the relational model), outlined in a Computerworld article, which became benchmarks for evaluating relational database management systems (RDBMS) and drove industry standardization.[17] The 1980s saw extensions to the relational paradigm with object-oriented approaches, addressing limitations in handling complex, non-tabular data. Object-oriented database management systems (OODBMS) emerged in the mid-1980s, integrating object-oriented programming concepts like encapsulation and inheritance directly into data storage, as seen in early systems like GemStone (started in 1982) and applications in computer-aided design (CAD).[18] The 1990s shifted focus toward integrated analytics, with data warehousing becoming central; Bill Inmon's 1992 book "Building the Data Warehouse" defined it as a subject-oriented, integrated, time-variant, and non-volatile repository for decision support, influencing enterprise architectures for business intelligence.[19] Entering the 2000s, data architecture evolved to accommodate unstructured and massive-scale data through distributed paradigms, moving beyond centralized relational systems. The rise of XML, standardized by the W3C in 1998, facilitated interoperable data exchange with its extensible markup for semi-structured information. Complementing this, Tim Berners-Lee's 2001 Scientific American article envisioned the Semantic Web, layering RDF and ontologies atop XML to enable machine-interpretable data semantics for the evolving web. Concurrently, NoSQL databases and big data frameworks addressed scalability limits of traditional models; Hadoop, developed by Doug Cutting and Mike Cafarella and released as an Apache project in 2006, drew from Google's MapReduce and GFS papers to support fault-tolerant, distributed processing of petabyte-scale data across clusters.[20] These advancements transitioned architectures from rigid, hierarchical roots to flexible, cloud-native designs capable of handling diverse, high-volume data flows.Importance and Applications
Data architecture plays a pivotal role in enabling organizations to leverage data as a strategic asset, fostering data-driven decision-making by providing structured access to reliable information across business units. This capability allows executives to base strategies on real-time insights rather than intuition, leading to more accurate forecasting and resource allocation. For instance, robust data architectures support operational efficiency by streamlining data flows and reducing processing times, which can accelerate time-to-market by 30% through modular designs.[21] Additionally, it ensures compliance with regulations such as the General Data Protection Regulation (GDPR) of 2018 by incorporating governance frameworks, data masking, and audit trails to protect sensitive information and mitigate legal risks. Furthermore, scalable data architectures accommodate organizational growth by handling increasing data volumes via elastic cloud-based platforms, enabling seamless expansion without proportional cost increases.[22][21] In business applications, data architecture underpins key functions like customer relationship management (CRM), where integrated data platforms enable real-time personalization, such as targeted offers based on customer behavior, improving engagement and retention. It also optimizes supply chains by integrating sensor data for predictive maintenance, reducing downtime and enhancing logistics efficiency in industries like manufacturing and retail. For financial reporting, standardized data models ensure accurate, timely consolidation of transactions, supporting regulatory filings and internal audits. A notable example is retail analytics, where data architectures power personalized marketing campaigns; companies like Amazon utilize recommendation engines built on collaborative filtering to drive sales, contributing to significant revenue growth through hyper-personalized suggestions.[23][24] Poorly designed data architecture often results in data silos, where isolated systems hinder collaboration and lead to inefficiencies; studies indicate that data users can spend 30-40% of their time searching for data due to fragmented inventories, effectively reducing overall productivity. Effective architectures counteract this by promoting data integration, yielding ROI through reduced redundancy—potentially saving millions in storage costs—and faster query times that enable quicker insights, with some organizations reporting deployment reductions from months to days. Overall, these improvements can generate substantial value, such as up to $500 million in annual benefits for large banks through enhanced analytics capabilities.[25][23] Across industries, data architecture delivers transformative applications. In healthcare, it facilitates electronic health record (EHR) integration, allowing seamless data exchange between systems to improve patient outcomes and operational efficiency, with integrated EHRs potentially adding 10-20% to contribution margins per hospital bed through better resource utilization. In finance, it supports risk modeling by providing standardized platforms for aggregating diverse data sources, enabling compliance with standards like BCBS 239 and reducing implementation costs by 20% via flexible architectures. In e-commerce, recommendation engines rely on scalable data architectures to process vast customer interaction datasets, powering personalized experiences that boost conversion rates and customer satisfaction, as demonstrated by platforms handling real-time analytics for dynamic suggestions.[26][21][24]Architectural Levels
Conceptual Data Architecture
Conceptual data architecture represents the highest level of abstraction in data modeling, providing a business-oriented framework that identifies and defines the essential data elements required to support organizational objectives, independent of any specific technology or implementation details. It emphasizes "what data is needed" to fulfill business requirements, such as capturing core concepts like entities and their interrelationships, rather than detailing storage mechanisms or processing methods. This approach ensures that data strategies align closely with enterprise goals, facilitating communication between business stakeholders and technical teams.[27][28] At its core, conceptual data architecture relies on entity-relationship (ER) modeling conducted at a business level, as originally proposed by Peter Chen, to represent real-world objects of interest—termed entities—along with their attributes and associations. For instance, in a retail context, entities might include customer and product, with relationships defining how purchases link them, thereby modeling the semantic structure of business data without delving into technical specifications. The primary purpose is to establish a unified view of data that supports decision-making, process optimization, and strategic planning by abstracting away implementation complexities.[29][30] Key artifacts in conceptual data architecture include conceptual data models, often visualized as ER diagrams that illustrate entities, attributes, and relationships in a simplified, high-level format. Complementary to these are business glossaries, which provide standardized definitions for data terms, and detailed data definitions that clarify the meaning and context of each element to prevent ambiguity across the organization. These artifacts serve as foundational references, enabling stakeholders to validate that the data scope adequately addresses business needs.[27][28] The development process begins with requirements gathering from diverse stakeholders, including business analysts, domain experts, and executives, to elicit critical data needs through workshops, interviews, and use case analysis. This is followed by identifying key data entities—such as customer, product, or order—and mapping their relationships to ensure comprehensive coverage of business processes. Throughout, the focus remains on aligning the model with broader enterprise goals, such as improving operational efficiency or enabling analytics, while iterating based on feedback to refine the abstract representation.[28][31] One major advantage of conceptual data architecture is its role as a blueprint that promotes data consistency across initiatives, reducing redundancy and misinterpretation in downstream designs. It also enhances scalability by establishing flexible structures that can adapt to evolving business demands without necessitating rework. Furthermore, by remaining technology-agnostic, it avoids vendor or platform lock-in, allowing organizations to select implementation options that best fit current and future needs. This conceptual framework transitions into logical data architecture by adding implementation-independent details like data types and normalization.[32][33][34]Logical Data Architecture
Logical data architecture serves as the bridge between the conceptual and physical layers of data design, providing an implementation-independent blueprint that specifies data types, relationships, and business rules without reference to storage mechanisms or hardware. It translates high-level conceptual entities into detailed, structured representations suitable for relational or other data models, ensuring that the logical structure aligns with organizational needs while remaining vendor-neutral. This layer focuses on defining how data elements interconnect logically to support queries, transactions, and analysis, thereby facilitating consistent data usage across applications.[35][33] Core elements of logical data architecture include logical data models, such as relational schemas comprising tables, primary and foreign keys, and constraints like cardinality and data types. These models organize data into relations where each table represents an entity with attributes, and keys enforce uniqueness and linkages between tables. Normalization processes are integral to refining these schemas, progressing from first normal form (1NF), which eliminates repeating groups by ensuring atomic values in each cell and unique records via primary keys, to second normal form (2NF), which removes partial dependencies by ensuring non-key attributes depend fully on the entire primary key. Further advancement to third normal form (3NF) eliminates transitive dependencies, where non-key attributes depend only on the primary key, and Boyce-Codd normal form (BCNF) strengthens this by requiring every determinant to be a candidate key, thus minimizing redundancy and anomalies. These normalization steps, originally proposed by E.F. Codd, ensure relational integrity and efficiency in data representation.[36][37] Key techniques in logical data architecture encompass data mapping to align source and target structures, integrity rules to maintain consistency, and abstract handling of data flows. Data mapping involves transforming conceptual elements, such as entities and relationships, into logical constructs like tables and joins, preserving semantics without physical details. Integrity rules, including referential integrity, enforce that foreign keys in one table reference valid primary keys in another or allow null values, preventing orphaned records and ensuring relational consistency as defined in relational database principles. At a logical level, ETL (Extract, Transform, Load) processes outline data flows by specifying extraction from heterogeneous sources, logical transformations like aggregation or filtering, and loading into target models, modeled conceptually to support integration without implementation specifics.[38][39][40] Practical examples illustrate these concepts: converting an entity-relationship (ER) diagram to relational tables might map a "Customer" entity with attributes like ID and Name to a table with a primary key on ID, while a one-to-many relationship to "Orders" creates a separate table with a foreign key referencing Customer ID. To address data quality issues like duplicates, unique identifiers such as composite keys or unique constraints are applied during normalization, ensuring each record's distinctiveness without relying on physical deduplication methods. These approaches build on conceptual entities by adding precise logical rules for robust data handling.[41]Physical Data Architecture
Physical data architecture encompasses the tangible implementation of data storage, retrieval, and management using specific hardware, software, and network configurations to realize the logical data model in a deployable system. It focuses on translating abstract logical structures into concrete physical entities, such as tables, files, and indexes within a database management system (DBMS), with primary objectives of optimizing query performance, ensuring scalability for growing data volumes, and controlling operational costs through efficient resource allocation. This layer addresses how data is physically organized on storage media to minimize access times and maximize throughput while accommodating hardware constraints.[42][43] Key aspects of physical data architecture include database design techniques like indexing and partitioning, which directly influence data access efficiency. Indexing strategies, such as clustered indexes that reorder physical data rows based on index keys or non-clustered indexes that maintain separate structures pointing to data locations, accelerate search operations by reducing the need for full table scans; for instance, a clustered index on a frequently queried column can improve range query performance by up to several orders of magnitude in relational databases. Partitioning divides large datasets into smaller, independent subsets—such as horizontal partitioning by row ranges or hash-based sharding—enabling parallel processing and easier maintenance, which is essential for handling terabyte-scale tables without proportional increases in query latency. Storage choices further tailor the architecture to data characteristics: structured data suits relational SQL databases like PostgreSQL with rigid schemas for ACID compliance, whereas NoSQL databases like MongoDB excel for unstructured data, storing documents in flexible BSON format to support variable schemas and high ingestion rates for sources like logs or multimedia.[44][45][46][47] Implementation details extend to hardware and network considerations that underpin reliable data distribution and access. Solid-state drives (SSDs) outperform hard disk drives (HDDs) in database environments due to their lower read/write latencies—typically 40-100 microseconds versus milliseconds for HDDs—and higher IOPS (up to 200,000 for enterprise SSDs), making them preferable for random access patterns in transactional workloads despite higher per-gigabyte costs. Network topologies in distributed systems, such as fully connected mesh for low-latency inter-node communication or hierarchical star configurations for scalable data replication, determine how data shards are distributed across clusters to balance load and fault tolerance; for example, a mesh topology minimizes communication overhead in small-scale distributed databases but scales poorly beyond dozens of nodes. Query optimization techniques, including join algorithms like hash joins for equi-joins on large datasets or nested-loop joins for small result sets, are selected by the DBMS optimizer to minimize CPU and I/O costs, with hash joins often achieving sub-linear time complexity by partitioning data into buckets. The physical architecture draws from logical schemas to define these elements, ensuring alignment with intended data flows.[48][49][50][51] Performance metrics in physical data architecture emphasize tuning for low latency (e.g., sub-millisecond query response times) and high throughput (e.g., millions of transactions per second), often measured via benchmarks like TPC-C for OLTP systems. Sharding exemplifies these optimizations in distributed setups: by horizontally partitioning data across nodes—such as range-based sharding on user IDs in a social media database—it enables parallel query execution, boosting throughput by factors of 10-100 while keeping per-shard latency stable, though it requires careful key selection to avoid hotspots. These metrics guide iterative refinements, such as index rebuilds or partition adjustments, to sustain scalability as data volumes grow.[52][53][54]Centralized data architectures
Centralized data architectures consolidate data management, storage, integration, governance, and access into a single, unified platform or team-managed environment. This approach establishes a "single source of truth," ensuring high consistency, simplified governance, and easier compliance—particularly valuable in regulated industries like banking, healthcare, or government. Key architectural models and patterns that support a centralized data estate include:- Traditional centralized (monolithic or hub-and-spoke model): A single central team manages the entire data lifecycle, from ingestion to serving. Data flows into one repository for unified modeling. Benefits include high control and reduced duplication; drawbacks include scalability bottlenecks.
- ** Data warehouse architecture **: A structured central repository for ETL-processed data, creating enterprise-wide conformed entities. Supports consistent analytics with layers like staging, integration, and consumption.
- Centralized data lake: A scalable repository for raw data in native formats, managed centrally for broad exploration and governance.
- Data lakehouse architecture: Unifies data lake flexibility with warehouse reliability (ACID, schema enforcement) in a single platform, often using open formats like Delta Lake or Iceberg. Provides centralized management for diverse workloads including AI/ML.
- ** Centralized Master Data Management (MDM) **: Designates a hub as the system of record for master entities, with patterns like consolidation or transactional styles ensuring enterprise consistency.
- Layered or hub-based architectures: Use central hubs with layers (raw, curated, consumption) for aggregation and governed access.