There are two key elements of keeping distributed systems up and running: redundancy within a site (called high availability or “HA”) and redundancy across multiple sites (called disaster recovery or “DR”). Solace PubSub+ gives you both without buying, deploying, and integrating third-party tools. I recently introduced Solace PubSub+ Event Broker’s High Availability functionality and will now introduce its disaster recovery capabilities.
Introduction to Disaster Recovery
Here is the definition of disaster recovery from Wikipedia: Disaster Recovery involves a set of policies, tools, and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.
In simple terms, DR is the procedure that is followed by an organization to keep their business processes and systems up and running when a datacenter goes down due to a national disaster, network disconnection, or loss of power. The majority of mission-critical applications in banking, capital markets, exchanges cannot afford to go down, so it’s important for organizations to invest in a secondary “DR site” that can quickly take over the systems lost due to a datacenter-level outage.
Before going into details of how Solace PubSub+ Event Broker achieves disaster recovery, I would like to introduce some of the basic terms that we use in the context of disaster recovery: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Here is how Wikipedia defines them:
- Recovery Time Objective or RTO: RTO is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) to avoid unacceptable consequences associated with a break in business continuity.
- Recovery Point Objective or RPO: RPO is the maximum targeted period in which data (transactions) might be lost from an IT service due to a major incident.
Now that we have some basics of what disaster recovery is, let’s dive into how it is handled in Solace PubSub+ Event Broker.
Disaster Recovery in Solace PubSub+ Event Broker
Advantages of DR in PubSub+
Solace PubSub+ provides DR functionality without using complex external mechanisms like storage replication, mirror gateways, or plugins. It can automatically replicate messages and message delivery state from event brokers in the active site to event brokers in the DR site. Messages can be replicated either synchronously or asynchronously, depending on the significance of the message. For example, we can have messages on payment topic as synchronous replication and messages on a less critical topic (like log messages) to have asynchronous replication. Solace PubSub+ can also propagate all configuration changes from the active site brokers to the DR site brokers. Thus, there is no need to coordinate or port over configuration changes among the brokers.
The broker can be set up in such a way that we can achieve 0 RPO and minimal RTO.
DR Setup
The figure below shows the high-level setup of two brokers in a pair of linked datacenters.
As you can see, there will be a replication link setup across the two sites. The replication will be done over the WAN link between the two datancenters. Both the message state and contents will be replicated between the two brokers. Only guaranteed messages will be replicated. All direct/QoS0 messages are not replicated as Quality of Service for this scenario will be “at most once” delivery.
DR Modes of Replication
Replication can be performed using two modes: synchronous replication and asynchronous replication. These modes can be configured per topic basis.
Synchronous Replication
In this mode, the message is replicated to both datacenters before the acknowledgment is sent to the publisher. This makes sure that there is no message lost in case of disaster. With this kind of replication, an organization can achieve 0 RPO. However, keep in mind that there is a performance penalty for the publisher in this mode as the publisher’s message rate depends on the round-trip latency between the two datacenters.
The message flow for this kind of replication is shown in the figure below:
Asynchronous Replication
In this mode, the message acknowledgment to the publisher is sent as soon as the message is persisted on the primary site. The message is also sent to the DR DC asynchronously. The only difference is that the acknowledgment to the publisher does not wait for the acknowledgment back from the DR site. Thus, the message is guaranteed to be published only on the primary site. This mode does have some risk of message loss. However, this improves the message rate performance of the publisher.
The message flow for this kind of replication is shown in the figure below:
Decision Table for Mode of Replication
It’s hard to decide if sync or async replication should be used. The good news is that sync or async replication can be configured per topic basis. This granularity provides an option to have messages on important topics replicated synchronously. However, less important messages which can afford some loss can be replicated asynchronously.
The following table will help you to decide which mode of replication should be used for which topics.
Downgrading to Asynchronous When Sync Ineligible
With synchronous replication, if the replication bridge connection is very slow or goes down, the processing of replicated messages and transactions will stop. To allow messages and transactions to continue to be processed by default, the broker will switch to asynchronous replication when the standby site is unreachable or slow. This behavior helps to avoid a business interruption when the standby site is temporarily unreachable.
Solace DR at the Message VPN Level
A message VPN is a Solace PubSub+ Event Broker concept that allows many separate applications to share a single Solace PubSub+ Event Broker while remaining independent and separated. Message VPNs enable the virtualization of an event broker into many individual virtual event brokers.
Message VPNs allow for the segregation of topic space and messaging space by creating entirely separate messaging domains. Message VPNs also group clients connecting to a network of Solace PubSub+ event brokers so that messages published within a particular group are only visible to clients that belong to that group. Each client connection is associated with a single Message VPN.
The explanation of Solace DR would not be complete without details of how it works at the Message VPN level. Some of the details are as below:
- Disaster Recovery is set up per message VPN basis.
- There will be an active message VPN on the primary site. The same message VPN will be in Standby mode in the DR site. Note: This shouldn’t be confused with Primary and Backup roles in High Availability Configuration.
- The Standby message VPN on the DR site will have the same message state as the Primary message VPN (depending on sync and async replication).
- The broker will not allow any clients to connect to the Standby message VPN.
At a very high level, the deployment will be as below:
Is it possible to have Multi-Site Active-Active in PubSub+ Event Broker?
More and more organizations are moving towards a multi-DC Active-Active setup. This has many advantages. For example, the DR DC (Or Site 2) will not be idle as a complete disaster is an infrequent occurrence. Due to the infrequent use of DR, there have been cases where the team was not thoroughly well-versed with the DR failover procedure. And the operation team was not able to restore the applications within the RTO specified in the SLA.
So, is it possible to have an Active-Active DR setup on Solace PubSub+ Event Broker? Yes, it is! And many Solace clients use this active-active multi-dc setup for mission-critical applications.
Let’s see in detail how the Active-Active configuration works in PubSub+ Event Broker.
The concept of Message VPN was introduced in the previous section. In the Active-Active scenario, we have some message VPN active on Site1 and others active on Site 2. The message state will cross-replicate each other. This is depicted in the figure below:
If we add HA into this mix, the overall diagram will be as below:
DR Failover
The failover to the DR site is often an action that cannot be performed only at the messaging layer. Typically, servers, critical applications, and other infrastructure must be switched as part of the failover. Usually, the failover is a coordinated operation that must be performed by network administrators. It does not happen automatically.
However, in a rare case, when there is a need to have this failover automatically, the logic of detecting the conditions for the failover and the broker’s actual failover can be baked into a script.
The figure below shows the details of DR failover:
As the figure shows, because of message replication, all the spooled messages will be available on Site 2. Failover can be manual or can be scripted for automation. The client application will have IP addresses of both sites. Once the failure is detected, the application will automatically connect to IP of the alternate site based on reconnection configuration provided in the application logic.
Resources
Overview
Setup
- Configuring System-Level Replication Settings
- Configuring VPN-Level Replication Settings
- Configuring Replication with a DMR Network
Management
Best Practices
Conclusion
I hope this post has helped you understand how we do Disaster Recovery in Solace PubSub+ Event Broker and how we can achieve multi DC active-active setup. You may want to check out our Datacenter Replication Overview for more details and configuration steps. If you have any questions, post them to the Solace Developer Community.
Solace Named an Event Broker LeaderIDC MarketScape positions Solace in the Leaders category for worldwide event broker software.Explore other posts from category: For Developers