Recent from talks
Nothing was collected or created yet.
Corosync Cluster Engine
View on Wikipedia
| Corosync Cluster Engine | |
|---|---|
| Developer | The Corosync Development Community |
| Initial release | 2008 |
| Stable release | 3.1.10[1] |
| Repository | |
| Written in | C |
| Operating system | Cross-platform |
| Type | Group Communication System |
| License | New BSD License |
| Website | corosync |
The Corosync Cluster Engine is an open source implementation of the Totem Single Ring Ordering and Membership protocol. It was originally derived from the OpenAIS project and licensed under the new BSD License. The mission of the Corosync effort is to develop, release, and support a community-defined, open source cluster.
Features
[edit]The Corosync Cluster Engine is a group communication system with additional features for implementing high availability within applications.
The project provides four C application programming interface (API) features:
- A closed process group communication model with virtual synchrony guarantees for creating replicated state machines.
- A simple availability manager that restarts the application process when it has failed.
- A configuration and statistics in-memory database that provides the ability to set, retrieve, and receive change notifications of information.
- A quorum system that notifies applications when quorum is achieved or lost.
The software is designed to operate on UDP/IP and InfiniBand networks.
Architecture
[edit]The software is composed of an executive binary which uses a client-server communication model between libraries and service engines. Loadable modules, called service engines, are loaded into the Corosync Cluster Engine and use the services provided by the Corosync Service Engine internal API.
The services provided by the Corosync Service Engine internal API are:
- An implementation of the Totem Single Ring Ordering and Membership[2] protocol providing the Extended Virtual Synchrony model[3] for messaging and membership.
- The coroipc high performance shared memory IPC system.[4]
- An object database that implements the in memory database model.
- Systems to route IPC and Totem messages to the correct service engines.
Additionally Corosync provides several default service engines that are used via C APIs:
- cpg - Closed Process Group
- sam - Simple Availability Manager
- confdb - Configuration and Statistics database
- quorum - Provides notifications of gain or loss of quorum
History
[edit]The project was formally announced in July 2008 via a conference paper at the Ottawa Linux Symposium.[5] The source code of OpenAIS was refactored such that the core infrastructure components were placed into Corosync and the SA Forum APIs were kept in OpenAIS.
In the second version of corosync, published in 2012, quorum subsystem was changed and integrated into the daemon.[6] This version is available since Fedora 17 and RHEL7.[7]
Flatiron branch (1.4.x) development ended with 1.4.10 release.[8] Needle branch was announced stable with 2.0.0 release on 10 April 2012.[9][10] Development of this branch stopped with 2.4.6 release on 9 November 2022, because 3.x branch (Camelback) was considered to be stable after almost 4 years of work.[9]
See also
[edit]References
[edit]- ^ "Release 3.1.10". 15 November 2025. Retrieved 17 November 2025.
- ^ Amir, Y.; Moser, L.E.; Melliar-Smith, P.M.; Agarwal, D.A.; Ciarfella, P. (November 1995). "The Totem Single Ring Ordering and Membership Protocol". ACM Transactions on Computer Systems. 13 (4): 311–342. doi:10.1145/210223.210224. S2CID 15165593.
- ^ Moser, L.E.; Amir, Y.; Melliar-Smith, P.M.; Agarwal, D.A. (November 1995). "Extended Virtual Synchrony". ACM Transactions on Computer Systems. 13 (4): 311–342. doi:10.1145/210223.210224. S2CID 15165593. Also in Proceedings of DCS, pp. 56-65, 1994.
- ^ Dake, S. (July 2009). "The Corosync High Performance Shared Memory IPC Reusable C Library" (PDF). Proceedings of the Linux Symposium: 61–68.
- ^ Dake, S.; Caulfield, C.; Beekhof, A. (July 2008). "The Corosync Cluster Engine" (PDF). Proceedings of the Linux Symposium: 85–99.
- ^ Christine Caulfield,New quorum features in Corosync 2 - 2012-2016 (in English)
- ^ Linux Cluster next generation, LVEE, 2013
- ^ "Releases v1.4.10". GitHub. Retrieved 23 November 2022.
- ^ a b "Releases v2.4.6". GitHub. Retrieved 23 November 2022.
- ^ "Releases v2.0.0". GitHub. Retrieved 23 November 2022.
External links
[edit]- Official website
- "The Totem Single-Ring Ordering and Membership Protocol". CiteSeerX 10.1.1.37.767.
{{cite journal}}: Cite journal requires|journal=(help) - "Totem: A Reliable Ordered Delivery Protocol for Interconnected Local-Area Networks". CiteSeerX 10.1.1.52.4028.
{{cite journal}}: Cite journal requires|journal=(help) - "Extended Virtual Synchrony model". CiteSeerX 10.1.1.55.8677.
{{cite journal}}: Cite journal requires|journal=(help) - Corosync High Performance Shared Memory IPC Reusable C Library
Corosync Cluster Engine
View on GrokipediaOverview
Definition and Purpose
The Corosync Cluster Engine is an open-source Group Communication System that implements the Totem Single Ring Ordering and Membership protocol to facilitate reliable messaging in clustered environments.[6][7] This protocol enables a token-passing mechanism over a logical ring topology, ensuring deterministic ordering of messages across nodes.[8] Its primary purpose is to provide foundational primitives for fault-tolerant group messaging, membership tracking, and high availability in Linux-based clusters, allowing applications to detect and respond to failures without data loss or inconsistency.[6] By separating core communication infrastructure from higher-level services, Corosync supports interoperability and scalability in distributed systems.[7] Key use cases include enabling applications to maintain consistency during node failures, network partitions, or merges in demanding environments such as server farms for web services, shared storage systems for data redundancy, and cloud infrastructures for virtualized workloads.[9][10][11] For instance, it underpins high-availability setups where rapid failover is critical to minimize downtime.[12] Corosync operates on a group communication model featuring closed process groups, where members deliver messages with total order guarantees and reach agreement on membership changes to ensure a consistent view across the cluster.[13] This model, rooted in extended virtual synchrony, delivers messages and configuration updates in a system-wide consistent sequence, even amid partitions or restarts.[8][7]Licensing and Development
Corosync Cluster Engine is released under the 3-clause BSD License, a permissive open-source license that allows redistribution and modification for both commercial and non-commercial purposes, provided the copyright notice, conditions, and disclaimer are retained in all copies or substantial portions of the software.[14] This licensing model, originally associated with copyrights held by entities like MontaVista Software and Red Hat, Inc., facilitates widespread adoption by minimizing legal barriers while prohibiting the use of contributor names for endorsement without permission.[14] The software is developed collaboratively by the Corosync Development Community, with its source code repository hosted on GitHub, enabling contributions from a diverse group of developers and organizations.[15] Key contributors include teams from Linux distributions such as Red Hat, which has held significant copyright interests since 2005, SUSE, which integrates and maintains Corosync in its high availability extensions, and Proxmox Server Solutions, which relies on it as a core component for clustering in its virtualization platform.[14][16] This community-driven approach ensures ongoing enhancements through pull requests, issue tracking, and collaborative releases. Corosync is implemented primarily in the C programming language to optimize performance in cluster environments, leveraging low-level system calls for efficient operation.[7] It employs POSIX-compliant APIs to enhance portability, targeting Unix-like operating systems such as various Linux kernels, with support for multiple hardware architectures including x86, ARM, and PowerPC.[17][18] As of 2025, the project maintains active development under the Camelback branch (version 3.x series), featuring regular maintenance releases that address security vulnerabilities and introduce stability improvements, with the latest stable version 3.1.10, released on November 15, 2025, distributed through major Linux repositories.[3]Key Features
Communication Protocols
The Corosync Cluster Engine primarily relies on the Totem Single Ring Ordering and Membership Protocol to facilitate reliable group communication among cluster nodes. This protocol employs a single-ring token-passing mechanism over a broadcast domain, such as Ethernet, where a logical token circulates among nodes to enable multicast messaging. Each node appends messages to the token before passing it to the next node, ensuring that all messages are delivered in a total order determined by the token's sequence number. This approach guarantees stable delivery, meaning messages are safely ordered and acknowledged by all nodes in the current configuration before being processed, even in the face of node failures or network partitions.[8][7] At the core of Totem's reliability is the Extended Virtual Synchrony (EVS) model, which extends traditional virtual synchrony to handle partitionable environments and node restarts. EVS provides key guarantees including agreement, where all non-faulty nodes deliver the same set of messages in the same order; integrity, ensuring each message is delivered exactly once with a unique identifier; and virtual synchrony, which maintains consistent views of message delivery and membership changes across partitions. These properties enable event delivery to applications in a system-wide consistent manner, allowing distributed systems to coordinate actions reliably despite transient failures.[8][7] Corosync supports multiple network layers through Totem, including UDP/IP for unicast and multicast over IPv4 and IPv6, as well as InfiniBand for low-latency, high-throughput communication in high-performance computing environments. Redundancy is achieved via the Totem Redundant Ring Protocol, which operates multiple independent rings over separate network interfaces to tolerate link or interface failures without disrupting the cluster. Fault tolerance is further enhanced by automatic membership changes triggered by the protocol's membership service, which detects partitions through heartbeat timeouts—typically monitoring token circulation delays—and employs merge algorithms to resolve network splits by selecting a primary partition and reintegrating others based on quorum or configuration rules.[7][19] Specific protocol mechanisms include a default token rotation interval of 1000 milliseconds, which balances responsiveness with network overhead by setting the expected time for the token to complete a full ring cycle. To handle large payloads, Totem supports message fragmentation, breaking messages into smaller segments that are reassembled at the receiver while preserving order. Flow control is integrated through the token-passing discipline and acknowledgment-based recovery, preventing congestion by limiting outstanding messages and retransmitting lost fragments during ring reconfiguration.[7][8]Application Programming Interfaces
The Corosync Cluster Engine provides several C-based application programming interfaces (APIs) that enable developers to build cluster-aware applications with high availability features. These APIs abstract the underlying group communication and membership protocols, offering guarantees for message delivery, state consistency, and fault tolerance. The primary APIs include the Closed Process Group (CPG) for multicast messaging, the Simple Availability Manager (SAM) for process monitoring and recovery, the Configuration Database (ConfDB) for state management, and the Quorum API for cluster health assessment.[20][21] The Closed Process Group (CPG) API facilitates multicast messaging within dynamically formed groups of processes across cluster nodes. It supports functionalities such as joining or leaving groups, sending messages to group members, delivering configuration changes, and iterating over group membership. The API ensures extended virtual synchrony guarantees, including self-delivery, causal ordering (where messages from the same sender are delivered in send order), and total (agreed) ordering for multicast messages, which is essential for implementing replicated state machines without complex synchronization logic.[22][13][7] Developers initialize a CPG connection usingcpg_model_initialize, specifying callbacks for message delivery (cpg_deliver_fn_t) and configuration changes (cpg_confchg_fn_t), along with a context for application data. For example, a basic setup might involve creating a handle with model CPG_MODEL_V1, registering callbacks to process incoming messages or membership updates, and dispatching events via cpg_dispatch in a loop; error handling includes checking return codes like CS_ERR_TRY_AGAIN for transient failures or CS_ERR_BAD_HANDLE for disconnections, prompting reconnection attempts.[23][24]
The Simple Availability Manager (SAM) API manages the health and availability of application processes, particularly during cluster membership changes. It performs periodic health checks—either application-driven or event-driven via registered callbacks—and restarts unresponsive processes by sending signals (default SIGTERM, escalating to SIGKILL if necessary). SAM integrates with cluster events to provide availability notifications and supports resource fencing by enforcing recovery policies that prevent split operations during node failures or partitions, configurable through restart counters and intervals.[25][26] Initialization occurs via sam_initialize, followed by sam_register to monitor a process, with optional sam_hc_callback_register for custom health checks; errors such as failed restarts are handled by querying restart counts or adjusting recovery policies.
The Configuration Database (ConfDB) API offers access to an in-memory, hierarchical database for storing and retrieving cluster state, configuration parameters, and statistics. It allows applications to set key-value pairs, query object hierarchies (e.g., by parent and object handles), and receive notifications of changes through dispatch mechanisms. This API ensures consistent state propagation across nodes, supporting reliable data access during runtime updates without persistent storage overhead.[27][20] Connections are established with confdb_initialize, providing a callback (confdb_change_notify_fn_t) for updates on keys like object names and values; developers dispatch changes via confdb_dispatch and handle errors such as CONFDB_ERR_NOT_FOUND for missing objects.[27]
The Quorum API enables applications to monitor cluster health and make decisions based on majority voting to prevent split-brain scenarios, where partitioned subsets might act independently. It provides queries for current quorum status (e.g., whether the cluster has a majority) and notifications for state transitions, such as quorum gain or loss, often tied to node membership changes. This helps applications pause operations or failover resources when quorum is lost, ensuring data integrity.[28][20] Usage involves initializing a handle with quorum_initialize, registering for events via callbacks, and dispatching with quorum_dispatch to process flags like CS_DISPATCH_ALL; error handling includes verifying quorum before critical actions to avoid operations in minority partitions.[29]
System Architecture
Core Components
The Corosync Cluster Engine employs a client-server architecture, where the executive binary,corosync, operates as the central server daemon responsible for managing all cluster logic, including communication protocols, service orchestration, and state synchronization across nodes.[7] This executive handles incoming requests from client processes via a thin inter-process communication (IPC) layer, ensuring efficient and secure interaction without direct access to internal components.[7] Client libraries, such as those providing SA Forum Application Interface Specification (AIS) APIs, allow third-party applications to connect to the executive and access cluster services, using file descriptors for request-response exchanges.[7][19]
CoroIPC serves as the shared memory-based IPC mechanism facilitating high-performance local messaging between the executive and connected clients or services.[30] It utilizes mmap() for zero-copy communication, mapping shared memory regions that include control buffers, request/response queues, and dispatch channels, with System V semaphores for signaling. Each connection provides two file descriptors—one for blocking synchronous requests and another for non-blocking asynchronous callbacks—enabling thread-safe operations secured by UID/GID checks to prevent unauthorized access.[30] This design achieves low-latency performance, supporting up to 1 million transactions per second in multi-client scenarios on modern hardware.[30]
The Object Database (ObjDB) functions as an in-memory, non-persistent storage system for configuration data and runtime state, organized in a hierarchical tree structure of objects and key-value pairs.[7] Objects act as containers (e.g., logging.logger), while keys store values (e.g., object.key=value), supporting operations like creation, deletion, and validation via callbacks to ensure data integrity.[7][31] Runtime modifications to ObjDB are managed through tools like corosync-objctl, which allow querying, setting, or tracking changes without disrupting cluster operations.[31]
Message routing within Corosync occurs through an internal service manager that forwards IPC requests from clients to the appropriate service engines while delivering multicast messages from the Totem protocol layer across the cluster.[7] This mechanism enforces isolation between services and clients, routing responses back via CoroIPC channels and ensuring secure, ordered delivery without exposing underlying protocol details.[7]
The startup process begins with the executive daemon (corosync) initializing from the configuration file /etc/corosync/corosync.conf, loading ring parameters and authentication keys (e.g., via corosync-keygen).[19] The configuration engine parses and populates the ObjDB, followed by the service manager activating loaded service engines in sequence.[7] Once initialized, the executive establishes Totem protocol connections for cluster membership and begins accepting client connections via CoroIPC.[7][19]
Service Engines
The Corosync Cluster Engine employs loadable modules known as service engines to implement specific cluster functionalities, allowing the core infrastructure to remain modular and extensible without embedding all features directly into the executive. These engines leverage the internal service engine API to interact with the underlying transport and membership layers, enabling developers to build high-availability applications atop a standardized foundation.[7][6] Corosync includes several default service engines that provide essential capabilities for cluster operations. The Closed Process Group (CPG) engine facilitates group communication with virtual synchrony guarantees, allowing applications to join process groups and multicast messages reliably across nodes using APIs such ascpg_join() and cpg_mcast().[7][6] The Simple Availability Manager (SAM) engine handles availability management by monitoring application processes through health checks and restarting them if they become unresponsive, employing a forked server process to enforce recovery policies like signal-based termination followed by restarts.[25][6] The Configuration Database (ConfDB) engine maintains an in-memory object database for storing and retrieving cluster configuration and statistics, supporting operations even when the engine is offline and providing change notifications via callbacks.[7][6] Finally, the Quorum engine oversees cluster membership and consistency by tracking votes to prevent split-brain scenarios, notifying applications of quorum status changes to ensure safe operations.[6][32]
Service engines are dynamically loaded by the Corosync executive during startup, based on configuration directives, using a Live Component Replacement (LCR) mechanism that injects complete C interfaces into the process address space without restarting the engine. Each engine registers callbacks with the service manager for key events, including initialization, message processing, and membership changes, allowing seamless handling of network partitions or merges.[7][33]
The Quorum engine supports configurable policies to adapt to various cluster topologies, such as a default majority requiring more than 50% of votes (e.g., five votes in an eight-node cluster with one vote per node) or specialized modes like two-node setups where quorum can be achieved with a single node. It integrates with expected votes, which can be statically defined or dynamically adjusted via features like Last Man Standing, enabling clusters to shrink gracefully as nodes fail while maintaining consistency.[32][7]
Corosync's extensibility allows third-party developers to create custom service engines through a plugin API, where modules implement a defined lifecycle including initialization (exec_init_fn), finalization (exec_exit_fn), recovery during partitions (sync_recover_fn), and event processing callbacks. This design supports integration with external tools like Pacemaker without altering the core engine.[33][7]
Engines interact internally via the service engine API, routing requests and events through the service manager, while the Totem protocol layer provides the underlying transport for ordered, reliable message delivery across the cluster. This model ensures that engines like Quorum and CPG can synchronize state changes, such as checkpoints after partitions, using iterative algorithms to maintain consistency without direct peer-to-peer dependencies.[7][6]
Configuration and Deployment
Basic Setup
The installation of Corosync begins with obtaining the package from the distribution's repositories, as it is available in most major Linux distributions. On Red Hat Enterprise Linux and compatible systems such as Fedora or CentOS, enable the High Availability repository and install usingdnf install corosync or yum install corosync.[34] On Debian and Ubuntu systems, install via apt install corosync. Corosync depends on libraries such as libqb for inter-process communication and IPC mechanisms, which are typically pulled in automatically by the package manager.[35]
After installation, the core configuration occurs in the /etc/corosync/corosync.conf file, which defines the cluster's communication parameters and node details.[36] This file consists of top-level sections such as totem for protocol settings, nodelist for node specifications, quorum for membership rules, and logging for output control. In the totem section, specify the transport protocol—either knet (the default and recommended for modern setups, supporting multiple redundant rings and encryption) or the legacy udp (using multicast). For knet, define ring interfaces under interface subsections with parameters like linknumber (starting from 0) and bindnetaddr (the network address or subnet for binding, e.g., 192.168.1.0). Node IDs are assigned in the nodelist section using unique 32-bit integers greater than 0 (e.g., nodeid: 1 for the first node), along with each node's IP addresses. Key parameters include token (default 3000 ms, the timeout before declaring a token loss and potential partition) and consensus (default 3600 ms, the time to achieve quorum agreement, minimum 1.2 times the token value).[36]
To initialize the cluster, copy the configured /etc/corosync/corosync.conf to all nodes, ensuring identical content except for node-specific details like IP addresses. Start the Corosync daemon on each node with systemctl start corosync and enable it for boot with systemctl enable corosync.[37] Verify the cluster formation and ring status using corosync-cfgtool -s, which displays output like "Printing ring status" followed by details on each ring ID, such as active status or faults. For quorum, a basic two-node setup requires an expected_votes value of 2 and two_node: 1 in the quorum section to establish majority.
Basic troubleshooting involves checking logs located in /var/log/cluster/corosync.log for errors, as configured by the logging section's to_logfile directive (default enabled).[36] Common issues include ring faults marked as "FAULTY" in logs, often due to network misconfigurations or interface mismatches across nodes; resolve by verifying bindnetaddr and linknumber consistency in corosync.conf and testing connectivity with ping. Another frequent error is ring ID mismatches during cluster join, caused by differing configuration versions—increment the config_version in totem and synchronize files to correct this.[38] If consensus fails, adjust token and consensus values based on network latency, but avoid values below recommended minimums to prevent false partitions.[36]
For enhanced reliability in two-node setups, configure a quorum device (qdevice) to provide an additional vote and break ties in case of partitions, preventing split-brain issues.[39]
Integration with Cluster Managers
Corosync primarily integrates with the Pacemaker cluster resource manager to form full high-availability (HA) clusters, where Corosync serves as the underlying communication layer responsible for messaging, membership tracking, and quorum determination, while Pacemaker handles resource allocation, monitoring, and failover decisions.[40] This separation allows Corosync to focus on reliable inter-node communication, enabling Pacemaker to detect failures and orchestrate resource movements without direct involvement in low-level networking. In such setups, Corosync's APIs, including Closed Process Groups (CPG) for multicast messaging and the Configuration Database (ConfDB) for storing cluster state, provide the foundational services that Pacemaker relies on for coordinated operations.[7] Corosync's integration extends to major Linux distributions' HA solutions, enhancing enterprise-grade clustering. In Red Hat Enterprise Linux (RHEL), the High Availability Add-On pairs Corosync with Pacemaker to manage services like databases and web servers across nodes, supporting configurations up to 32 nodes in standard setups.[37][41] Similarly, SUSE Linux Enterprise High Availability (SLE HA) incorporates Corosync as the messaging layer alongside Pacemaker, facilitating active/active and active/passive clusters with up to 32 nodes and features like resource migration.[42] For virtualization, Proxmox Virtual Environment (Proxmox VE) leverages Corosync's cluster engine for node synchronization and HA, enabling live migration of virtual machines in distributed environments.[43] Introduced in Corosync 3.x, the Kronosnet (knet) layer acts as a modern transport abstraction that enhances integrations by providing built-in redundancy across multiple network links, optional encryption via libraries like NSS or OpenSSL, and compression to optimize bandwidth in HA setups.[44] This layer replaces older transports like UDP, allowing seamless failover and secure communication in Pacemaker-managed clusters without requiring external tools.[45] The integration yields key benefits for HA ecosystems, including resource fencing to isolate faulty nodes and prevent data corruption, STONITH (Shoot The Other Node In The Head) mechanisms to power off unresponsive nodes via external devices like IPMI, and support for live migration of stateful resources such as virtual machines with minimal downtime.[37] These features ensure cluster integrity during failures, as demonstrated in RHEL and SLE HA deployments where STONITH blocks split-brain scenarios.[42] A typical workflow for configuring Pacemaker with Corosync involves initializing the cluster via tools likepcs or crm, where Pacemaker subscribes to Corosync's CPG for real-time node membership updates and event notifications, ensuring synchronized actions across nodes. Resource definitions, such as virtual IP addresses or services, are then stored in ConfDB, which Pacemaker queries to enforce constraints like colocation or ordering during failover—for instance, defining a primitive resource in the Cluster Information Base (CIB) that references Corosync's configuration for consistent state propagation.[40]
