There are two key elements of keeping distributed systems up and running: redundancy within a site (called high availability or “HA”) and redundancy across multiple sites (called disaster recovery or “DR”). Solace PubSub+ gives you both without buying, deploying, and integrating third-party tools. I recently introduced Solace PubSub+ Event Broker’s High Availability functionality and will now introduce its disaster recovery capabilities.
Here is the definition of disaster recovery from Wikipedia: Disaster Recovery involves a set of policies, tools, and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.
In simple terms, DR is the procedure that is followed by an organization to keep their business processes and systems up and running when a datacenter goes down due to a national disaster, network disconnection, or loss of power. The majority of mission-critical applications in banking, capital markets, exchanges cannot afford to go down, so it’s important for organizations to invest in a secondary “DR site” that can quickly take over the systems lost due to a datacenter-level outage.
Before going into details of how Solace PubSub+ Event Broker achieves disaster recovery, I would like to introduce some of the basic terms that we use in the context of disaster recovery: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Here is how Wikipedia defines them:
Now that we have some basics of what disaster recovery is, let’s dive into how it is handled in Solace PubSub+ Event Broker.
Solace PubSub+ provides DR functionality without using complex external mechanisms like storage replication, mirror gateways, or plugins. It can automatically replicate messages and message delivery state from event brokers in the active site to event brokers in the DR site. Messages can be replicated either synchronously or asynchronously, depending on the significance of the message. For example, we can have messages on payment topic as synchronous replication and messages on a less critical topic (like log messages) to have asynchronous replication. Solace PubSub+ can also propagate all configuration changes from the active site brokers to the DR site brokers. Thus, there is no need to coordinate or port over configuration changes among the brokers.
The broker can be set up in such a way that we can achieve 0 RPO and minimal RTO.
The figure below shows the high-level setup of two brokers in a pair of linked datacenters.
As you can see, there will be a replication link setup across the two sites. The replication will be done over the WAN link between the two datancenters. Both the message state and contents will be replicated between the two brokers. Only guaranteed messages will be replicated. All direct/QoS0 messages are not replicated as Quality of Service for this scenario will be “at most once” delivery.
Replication can be performed using two modes: synchronous replication and asynchronous replication. These modes can be configured per topic basis.
In this mode, the message is replicated to both datacenters before the acknowledgment is sent to the publisher. This makes sure that there is no message lost in case of disaster. With this kind of replication, an organization can achieve 0 RPO. However, keep in mind that there is a performance penalty for the publisher in this mode as the publisher’s message rate depends on the round-trip latency between the two datacenters.
The message flow for this kind of replication is shown in the figure below:
In this mode, the message acknowledgment to the publisher is sent as soon as the message is persisted on the primary site. The message is also sent to the DR DC asynchronously. The only difference is that the acknowledgment to the publisher does not wait for the acknowledgment back from the DR site. Thus, the message is guaranteed to be published only on the primary site. This mode does have some risk of message loss. However, this improves the message rate performance of the publisher.
The message flow for this kind of replication is shown in the figure below:
It’s hard to decide if sync or async replication should be used. The good news is that sync or async replication can be configured per topic basis. This granularity provides an option to have messages on important topics replicated synchronously. However, less important messages which can afford some loss can be replicated asynchronously.
The following table will help you to decide which mode of replication should be used for which topics.
With synchronous replication, if the replication bridge connection is very slow or goes down, the processing of replicated messages and transactions will stop. To allow messages and transactions to continue to be processed by default, the broker will switch to asynchronous replication when the standby site is unreachable or slow. This behavior helps to avoid a business interruption when the standby site is temporarily unreachable.
A message VPN is a Solace PubSub+ Event Broker concept that allows many separate applications to share a single Solace PubSub+ Event Broker while remaining independent and separated. Message VPNs enable the virtualization of an event broker into many individual virtual event brokers.
Message VPNs allow for the segregation of topic space and messaging space by creating entirely separate messaging domains. Message VPNs also group clients connecting to a network of Solace PubSub+ event brokers so that messages published within a particular group are only visible to clients that belong to that group. Each client connection is associated with a single Message VPN.
The explanation of Solace DR would not be complete without details of how it works at the Message VPN level. Some of the details are as below:
At a very high level, the deployment will be as below:
More and more organizations are moving towards a multi-DC Active-Active setup. This has many advantages. For example, the DR DC (Or Site 2) will not be idle as a complete disaster is an infrequent occurrence. Due to the infrequent use of DR, there have been cases where the team was not thoroughly well-versed with the DR failover procedure. And the operation team was not able to restore the applications within the RTO specified in the SLA.
So, is it possible to have an Active-Active DR setup on Solace PubSub+ Event Broker? Yes, it is! And many Solace clients use this active-active multi-dc setup for mission-critical applications.
Let’s see in detail how the Active-Active configuration works in PubSub+ Event Broker.
The concept of Message VPN was introduced in the previous section. In the Active-Active scenario, we have some message VPN active on Site1 and others active on Site 2. The message state will cross-replicate each other. This is depicted in the figure below:
If we add HA into this mix, the overall diagram will be as below:
The failover to the DR site is often an action that cannot be performed only at the messaging layer. Typically, servers, critical applications, and other infrastructure must be switched as part of the failover. Usually, the failover is a coordinated operation that must be performed by network administrators. It does not happen automatically.
However, in a rare case, when there is a need to have this failover automatically, the logic of detecting the conditions for the failover and the broker’s actual failover can be baked into a script.
The figure below shows the details of DR failover:
As the figure shows, because of message replication, all the spooled messages will be available on Site 2. Failover can be manual or can be scripted for automation. The client application will have IP addresses of both sites. Once the failure is detected, the application will automatically connect to IP of the alternate site based on reconnection configuration provided in the application logic.
I hope this post has helped you understand how we do Disaster Recovery in Solace PubSub+ Event Broker and how we can achieve multi DC active-active setup. You may want to check out our Datacenter Replication Overview for more details and configuration steps. If you have any questions, post them to the Solace Developer Community.