Recent from talks
Nothing was collected or created yet.
Federated database system
View on WikipediaThis article needs additional citations for verification. (November 2023) |
A federated database system (FDBS) is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network and may be geographically decentralized. Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the (sometimes daunting) task of merging several disparate databases. A federated database, or virtual database, is a composite of all constituent databases in a federated database system. There is no actual data integration in the constituent disparate databases as a result of data federation.
Through data abstraction, federated database systems can provide a uniform user interface, enabling users and clients to store and retrieve data from multiple noncontiguous databases with a single query—even if the constituent databases are heterogeneous. To this end, a federated database system must be able to decompose the query into subqueries for submission to the relevant constituent DBMSs, after which the system must composite the result sets of the subqueries. Because various database management systems employ different query languages, federated database systems can apply wrappers to the subqueries to translate them into the appropriate query languages.
Definition
[edit]McLeod and Heimbigner[1] were among the first to define a federated database system in the mid-1980s.
A FDBS is one which "define[s] the architecture and interconnect[s] databases that minimize central authority yet support partial sharing and coordination among database systems".[1] This description might not accurately reflect the McLeod/Heimbigner[1] definition of a federated database. Rather, this description fits what McLeod/Heimbigner called a composite database. McLeod/Heimbigner's federated database is a collection of autonomous components that make their data available to other members of the federation through the publication of an export schema and access operations; there is no unified, central schema that encompasses the information available from the members of the federation.
Among other surveys,[2] practitioners define a Federated Database as a collection of cooperating component systems which are autonomous and are possibly heterogeneous.
The three important components of an FDBS are autonomy, heterogeneity and distribution.[2] Another dimension which has also been considered is the Networking Environment Computer Network, e.g., many DBSs over a LAN or many DBSs over a WAN update related functions of participating DBSs (e.g., no updates, nonatomic transitions, atomic updates).
FDBS architecture
[edit]A DBMS can be classified as either centralized or distributed. A centralized system manages a single database while distributed manages multiple databases. A component DBS in a DBMS may be centralized or distributed. A multiple DBS (MDBS) can be classified into two types depending on the autonomy of the component DBS as federated and non federated. A nonfederated database system is an integration of component DBMS that are not autonomous. A federated database system consists of component DBS that are autonomous yet participate in a federation to allow partial and controlled sharing of their data.
Federated architectures differ based on levels of integration with the component database systems and the extent of services offered by the federation. A FDBS can be categorized as loosely or tightly coupled systems.
- Loosely Coupled require component databases to construct their own federated schema. A user will typically access other component database systems by using a multidatabase language but this removes any levels of location transparency, forcing the user to have direct knowledge of the federated schema. A user imports the data they require from other component databases and integrates it with their own to form a federated schema.
- Tightly coupled system consists of component systems that use independent processes to construct and publicize an integrated federated schema.
Multiple DBS of which FDBS are a specific type can be characterized along three dimensions: Distribution, Heterogeneity and Autonomy. Another characterization could be based on the dimension of networking, for example single databases or multiple databases in a LAN or WAN.
Distribution
[edit]Distribution of data in an FDBS is due to the existence of a multiple DBS before an FDBS is built. Data can be distributed among multiple databases which could be stored in a single computer or multiple computers. These computers could be geographically located in different places but interconnected by a network. The benefits of data distribution help in increased availability and reliability as well as improved access times.
Heterogeneity
[edit]Heterogeneities in databases arise due to factors such as differences in structures, semantics of data, the constraints supported or query language. Differences in structure occur when two data models provide different primitives such as object oriented (OO) models that support specialization and inheritance and relational models that do not. Differences due to constraints occur when two models support two different constraints. For example, the set type in CODASYL schema may be partially modeled as a referential integrity constraint in a relationship schema. CODASYL supports insertion and retention that are not captured by referential integrity alone. The query language supported by one DBMS can also contribute to heterogeneity between other component DBMSs. For example, differences in query languages with the same data models or different versions of query languages could contribute to heterogeneity.
Semantic heterogeneities arise when there is a disagreement about meaning, interpretation or intended use of data. At the schema and data level, classification of possible heterogeneities include:
- Naming conflicts e.g. databases using different names to represent the same concept.
- Domain conflicts or data representation conflicts e.g. databases using different values to represent same concept.
- Precision conflicts e.g. databases using same data values from domains of different cardinalities for same data.
- Metadata conflicts e.g. same concepts are represented at schema level and instance level.
- Data conflicts e.g. missing attributes
- Schema conflicts e.g. table versus table conflict which includes naming conflicts, data conflicts etc.
In creating a federated schema, one has to resolve such heterogeneities before integrating the component DB schemas.
Schema matching, schema mapping
[edit]Dealing with incompatible data types or query syntax is not the only obstacle to a concrete implementation of an FDBS. In systems that are not planned top-down, a generic problem lies in matching semantically equivalent, but differently named parts from different schemas (=data models) (tables, attributes). A pairwise mapping between n attributes would result in mapping rules (given equivalence mappings) - a number that quickly gets too large for practical purposes. A common way out is to provide a global schema that comprises the relevant parts of all member schemas and provide mappings in the form of database views. Two principal approaches depend on the direction of the mapping:
- Global as View (GaV): the global schema is defined in terms of the underlying schemas
- Local as View (LaV): the local schemas are defined in terms of the global schema
Both are examples of data integration, called the schema matching problem.
Autonomy
[edit]Fundamental to the difference between an MDBS and an FDBS is the concept of autonomy. It is important to understand the aspects of autonomy for component databases and how they can be addressed when a component DBS participates in an FDBS. There are four kinds of autonomies addressed:
- Design Autonomy which refers to ability to choose its design irrespective of data, query language or conceptualization, functionality of the system implementation.
Heterogeneities in an FDBS are primarily due to design autonomy.
- Communication autonomy refers to the general operation of the DBMS to communicate with other DBMS or not.
- Execution autonomy allows a component DBMS to control the operations requested by local and external operations.
- Association autonomy gives a power to component DBS to disassociate itself from a federation which means FDBS can operate independently of any single DBS.
The ANSI/X3/SPARC Study Group outlined a three level data description architecture, the components of which are the conceptual schema, internal schema and external schema of databases. The three level architecture is however inadequate to describing the architectures of an FDBS. It was therefore extended to support the three dimensions of the FDBS namely Distribution, Autonomy and Heterogeneity. The five level schema architecture is explained below.
Concurrency control
[edit]The Heterogeneity and Autonomy requirements pose special challenges concerning concurrency control in an FDBS, which is crucial for the correct execution of its concurrent transactions (see also Global concurrency control). Achieving global serializability, the major correctness criterion, under these requirements has been characterized as very difficult and unsolved.[2]
Five level schema architecture for FDBSs
[edit]The five level schema architecture includes the following:
- Local Schema is basically the conceptual model of a component database expressed in a native data model.[3]
- Component schema is the subset of the local schema that the owner organisation is willing to share with other users of the FDBS and it is translated into a common data model.[3]
- Export Schema represents a subset of a component schema that is available to a particular federation.[3] It may include access control information regarding its use by a specific federation user. The export schema helps in managing flow of control of data.
- Federated Schema is an integration of multiple export schemas. It includes information on data distribution that is generated when integrating export schemas.[3]
- External schema is extracted from a federated schema, and is defined for the users/applications of a particular federation.[3]
While accurately representing the state of the art in data integration, the Five Level Schema Architecture above does suffer from a major drawback, namely IT imposed look and feel. Modern data users demand control over how data is presented; their needs are somewhat in conflict with such bottom-up approaches to data integration.
See also
[edit]References
[edit]- ^ a b c "McLeod and Heimbigner (1985). "A Federated Architecture for information management". ACM Transactions on Information Systems, Volume 3, Issue 3. pp. 253–278.
- ^ a b c "Sheth and Larson (1990). "Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases". ACM Computing Surveys, Vol. 22, No.3. pp. 183–236.
- ^ a b c d e Masood, Nayyer; Eaglestone, Barry (December 2003). "Component and Federation Concept Models in a Federated Database System" (PDF). Malaysian Journal of Computer Science. 16 (2): 47–57. Archived from the original (PDF) on 2016-03-07. Retrieved 2016-03-03.
External links
[edit]- DB2 and Federated Databases
- Issues of where to perform the join aka "pushdown" and other performance characteristics
- Worked example federating Oracle, Informix, DB2, and Excel
- Freitas, André, Edward Curry, João Gabriel Oliveira, and Sean O’Riain. 2012. “Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches, and Trends.” IEEE Internet Computing 16 (1): 24–33.
- IBM Gaian Database: A dynamic Distributed Federated Database
- Federated system and methods and mechanisms of implementing and using such a system
Federated database system
View on GrokipediaIntroduction
Definition
A federated database system (FDBS) is a meta-database management system that integrates multiple pre-existing, autonomous database systems into a single virtual database, without physically moving or copying data from the source systems.[3] This approach allows for controlled sharing of data across distributed environments while maintaining the independence of each participating database.[3] The primary purpose of an FDBS is to enable users to access and query heterogeneous, distributed data sources through a uniform interface, thereby facilitating global applications without disrupting local operations or requiring data centralization.[3] By providing this virtual integration, FDBS supports scenarios where organizations need to collaborate on information sharing while preserving the autonomy, heterogeneity, and distribution of their underlying databases.[3] Key components of an FDBS include wrappers, which serve as transforming processors to translate queries and data between different schemas and database dialects; export schemas, which define the portions of local data made available to the federation; and the federated schema, which virtually integrates the exported data into a cohesive global view.[3] Unlike centralized database systems, an FDBS avoids data replication or relocation, instead relying on these components to enable seamless access across autonomous sources.[3]Historical Development
The concept of a federated database system (FDBS) was introduced by Michael Hammer and Dennis McLeod in 1979, who coined the term in the context of database decentralization and information sharing.[1] This idea was further developed through early prototypes in the 1980s, such as Multibase and Mermaid, which demonstrated practical integration of heterogeneous databases.[1] The architecture was formalized in 1985 by Dennis Heimbigner and Dennis McLeod, who described it as a federated architecture uniting a collection of independent database management systems (DBMSs) into a loosely coupled federation, where cooperating systems share data through export schemas while maintaining autonomy.[4] This early vision emphasized the need for integration without centralization, addressing limitations in traditional distributed databases that required tight coupling and loss of local control.[4] Key advancements in the late 1980s and early 1990s built on this foundation through influential publications. Amit Sheth and James Larson, in their 1990 survey, provided a comprehensive reference architecture for FDBSs, focusing on federated schemas that enable the management of distributed, heterogeneous, and autonomous databases by mapping multiple local schemas to a shared global view.[3] Concurrently, Witold Litwin, Leo Mark, and Nick Roussopoulos outlined schema architectures for multidatabase interoperability in 1990, introducing models that support schema translation and integration across autonomous systems to facilitate query processing without full data replication.[5] During the 1990s and 2000s, FDBS concepts gained prominence amid the expansion of the internet and growing enterprise demands for integrating disparate data sources, such as in e-commerce and supply chain management, where virtual data views allowed real-time access without physical consolidation. This period also saw domain-specific approaches emerge, including the Open Geospatial Consortium's (OGC) specifications in the late 1990s and early 2000s, such as the Web Map Service (WMS) and Web Feature Service (WFS), which facilitated federated access to heterogeneous spatial data across distributed environments.[6][7] Post-2010 developments have been shaped by the rise of NoSQL databases and cloud computing, which introduced new challenges in heterogeneity and scalability, prompting extensions to FDBS for handling unstructured data and elastic resources in environments like multi-cloud setups; however, full standardization remains ongoing, with efforts focusing on semantic interoperability rather than rigid protocols.[2]Core Principles
Distribution
In a federated database system (FDBS), distribution refers to the storage of data across multiple networked, independent locations, such as autonomous database systems at different sites, without requiring centralization or physical relocation of the data. This approach allows participating databases to remain at their original sites while enabling controlled sharing through a virtual integration layer. Unlike traditional centralized systems, distribution in FDBS emphasizes logical federation over physical consolidation, interconnecting sites via communication networks to support collaborative access.[3][8] The primary benefits of this distribution include enhanced availability and fault tolerance, as data replication or partitioning across sites ensures continued access even if individual nodes fail. Local data placement also reduces latency by minimizing the distance queries must travel, improving overall performance for geographically dispersed users. These advantages make FDBS particularly suitable for environments requiring high reliability, such as enterprise networks spanning multiple organizations.[9][10][8] Access mechanisms in distributed FDBS rely on network protocols for data retrieval, such as standardized communication standards that facilitate request translation and response aggregation without moving data between sites. A federal controller or mediator layer coordinates these interactions, using mappings to route queries to the appropriate local databases while preserving site autonomy. This virtual federation approach ensures that data remains stationary, with only metadata and results transmitted over the network.[3][10][9] At a high level, distribution introduces challenges like network overhead from inter-site communications, which can degrade performance for frequent remote accesses, and partial failures where one site's outage affects global operations without immediate recovery. These issues arise due to the inherent reliance on distributed infrastructure, though they are mitigated through design choices in federation protocols.[8][10][9]Heterogeneity
Heterogeneity in federated database systems (FDBS) refers to the differences in data representation, structure, and meaning across interconnected, autonomous databases, arising primarily from the design autonomy of component systems. These variations encompass structural, syntactic, and semantic dimensions, each posing unique challenges to seamless integration and unified access. Early conceptualizations of FDBS, developed in the late 1980s and early 1990s, primarily addressed heterogeneity among relational and network database models, such as CODASYL, where differences in schema organization complicated data sharing. Over time, as data management evolved, heterogeneity expanded to include unstructured and semi-structured formats like XML documents and NoSQL stores, reflecting the growing diversity of information sources in modern environments.[1][11] Structural heterogeneity involves disparities in the underlying data models and organization, such as integrating relational database management systems (RDBMS) with XML-based stores or NoSQL key-value databases. For instance, a relational table might represent customer information in normalized rows and columns, while an XML document stores it hierarchically with nested elements, and a NoSQL document store like MongoDB uses JSON-like structures for flexible schemas. These differences lead to mismatches in data types, such as treating an address as a composite attribute in one system versus a separate entity in another, requiring mappings to align representations. The impact is significant: direct querying across such systems becomes infeasible without translation layers, as structural incompatibilities can result in incomplete or erroneous data retrieval, necessitating intermediate common data models for integration.[1][11] Syntactic heterogeneity manifests in variations of query languages and data formats, exemplified by the use of SQL in relational systems versus SPARQL for RDF/XML data in semantic web contexts or MongoDB's query API for NoSQL. Translating a SQL join operation to a SPARQL graph pattern or a NoSQL aggregation involves rewriting commands to account for differing syntax and operators, often introducing performance overhead. This type of heterogeneity hinders interoperability by demanding protocol wrappers or adapters at the federation layer, as seen in efforts to federate RDBMS with graph databases like Neo4j.[1][11] Semantic heterogeneity arises from conflicting interpretations or terminologies of data elements, such as one database using "customer" for retail clients while another employs "client" for service subscribers, potentially excluding relevant records in federated queries. Classic examples include discrepancies in attribute meanings, like "meal cost" including tax in one source but excluding it in another, or grading scales differing between "A-F" and "0-10" systems, which affect data comparability and aggregation. These issues are particularly acute in integrating legacy relational systems with modern NoSQL stores handling unstructured data, where implicit assumptions about data semantics can lead to inconsistencies. The overall impact is a barrier to meaningful data fusion, often requiring ontology-based resolution or manual reconciliation to ensure accurate federated operations, though such efforts tie into broader schema mapping strategies.[12][1]Autonomy
In a federated database system (FDBS), autonomy denotes the degree of independent control that component database systems (DBSs) maintain over their local operations, data, and policies while participating in the federation. This independence allows local DBSs to function without external interference from the federated layer or other participants, preserving their operational integrity and enabling voluntary cooperation.[1] Sheth and Larson identify four primary types of autonomy in FDBSs, each addressing a distinct aspect of local control. Design autonomy permits component DBSs to independently select their data models, representations, semantics, constraints, functionality, associations, and implementations, such as using relational or hierarchical schemas without federation-wide standardization.[1] Communication autonomy enables DBSs to decide whether to communicate with others and to choose their protocols and interfaces for such interactions, ensuring that local communication policies remain intact.[1] Execution autonomy grants DBSs the right to perform local operations, including query execution and optimization, free from external directives, while also controlling the sequencing of any federated operations at their site.[1] Finally, association autonomy allows DBSs to determine the extent of resource and functionality sharing with the federation, including the freedom to join, leave, or adjust participation levels at will, thereby making membership voluntary.[1] These autonomies collectively balance the benefits of global data access against the need for local sovereignty, fostering an environment where heterogeneous DBSs can interoperate without surrendering control. For instance, association autonomy underscores the voluntary nature of federation participation, allowing DBSs to export only selected data or schemas while retaining full authority over non-shared elements.[1] This setup supports controlled data sharing and accommodates diverse organizational policies, but it also introduces challenges in achieving seamless integration. A key trade-off arises from varying degrees of autonomy: higher levels enhance flexibility and preserve site-specific optimizations, yet they complicate global consistency, query optimization, and transaction management by introducing uncertainties in costs, constraints, and execution behaviors.[1] To mitigate these issues, full autonomy is often partially relaxed through predefined agreements, such as notifying the federation of local execution orders or limiting certain concurrency controls, without fully compromising independence. Autonomy thus influences concurrency mechanisms by requiring adaptive protocols that respect local policies, though detailed implementations vary by system design.[1]Architectural Framework
Schema Integration and Mapping
Schema integration in federated database systems (FDBS) involves the process of combining schemas from multiple autonomous and heterogeneous local databases into a unified federated virtual schema, enabling transparent access to distributed data without requiring physical data movement or replication. This integration addresses semantic, structural, and syntactic differences arising from the design autonomy of component databases, allowing users to query the federated schema as if it were a single coherent database. The primary goal is to establish correspondences between elements of local schemas, such as tables, attributes, and relationships, to support unified querying and data sharing across the federation.[1] Two fundamental approaches to schema integration are the Global-as-View (GaV) and Local-as-View (LaV) paradigms. In the GaV approach, the global schema is defined as a set of views over the local schemas, meaning the federated virtual schema is constructed by specifying how global relations are derived from unions or joins of local data sources; this facilitates easier query reformulation from the global to local level but can complicate maintenance when local schemas evolve. Conversely, the LaV approach treats each local schema as a view over the global schema, allowing for more flexible handling of source updates and incompleteness, though it poses greater challenges for query answering due to the need for containment mappings. These paradigms provide the foundational mapping strategies for resolving heterogeneity in FDBS, with GaV often preferred for warehouse-like scenarios and LaV for mediator-based architectures.[13] Schema matching and mapping constitute the core techniques for achieving integration. Schema matching identifies potential correspondences between schema elements, such as determining that an attribute "cust_name" in one local schema aligns with "customerName" in another, using methods like linguistic analysis of names, structural comparisons of keys and foreign keys, or instance-based analysis of data values. Once matches are identified, schema mapping defines the transformations needed to align them, including simple rules like attribute renaming or more complex ones such as value conversions (e.g., date formats) and aggregation functions. Attribute equivalence rules, for instance, might specify that two attributes represent the same real-world entity if their domains overlap and they share similar naming patterns or data distributions.[14] To enhance efficiency, especially in large-scale federations, semi-automated tools leveraging machine learning have been developed for schema matching. Systems like Cupid employ probabilistic models to select relevant features (e.g., attribute names, types, and instance values) and predict matches with high accuracy in various benchmarks. These methods reduce manual effort while handling the combinatorial explosion of potential mappings in heterogeneous environments, though human validation remains essential for semantic nuances. Recent advancements incorporate federated learning for privacy-preserving schema matching using hybrid feature sets, improving scalability and accuracy in distributed settings as of 2025.[14][15]Five-Level Schema Architecture
The five-level schema architecture for federated database systems (FDBS) extends the ANSI/SPARC three-schema architecture—originally comprising internal, conceptual, and external levels—to accommodate the challenges of distribution, heterogeneity, and autonomy in multi-database environments.[3] This extension, proposed by Sheth and Larson in 1990, introduces additional layers to facilitate controlled data sharing among autonomous component database systems (DBSs) without requiring their full restructuring.[3] The architecture consists of five distinct schema levels, each serving a specific purpose in bridging local database structures to a unified global view:- Local Schema: This represents the conceptual schema of an individual component DBS, defined in its native data model and tailored to the specific DBMS implementation. It captures the internal structure and semantics of the local data, managed entirely by the component DBA.[3]
- Component Schema: Derived from the local schema through translation into a canonical data model (CDM), this level abstracts DBMS-specific details and incorporates additional semantics to support integration and negotiation with other systems. It enables heterogeneity management by standardizing representations while preserving the original data's fidelity.[3]
- Export Schema: A controlled subset of the component schema, this defines the data and operations that the component DBS is willing to share with the federation. It includes access controls and constraints to enforce autonomy, allowing component DBAs to limit exposure without altering underlying local structures.[3]
- Federated Schema: This integrates multiple export schemas from participating component DBSs into a cohesive virtual view, incorporating distribution information such as data locations and mappings. Managed by the federation DBA, it provides a global perspective while maintaining loose coupling among components.[3]
- External Schema: Tailored subsets or views of the federated schema, these are customized for specific users or applications, applying additional constraints, access controls, and presentations to meet diverse needs. Multiple external schemas can coexist over a single federated schema, enhancing flexibility.[3]
