Enterprises spend large amounts of money on trying to achieve high availability by eliminating single points of failure. Having a redundant setup of all the critical components in an infrastructure gives peace of mind and can save millions of dollars in the event of component failure.
Messaging or data distribution is usually the nervous system of the overall infrastructure, so any failure of the messaging system can affect many applications, processes and customers. Thus, many companies invest a lot into making sure that high availability of messaging infrastructure is fail-safe. Apart from this, various other factors need to be taken into consideration, such as:
- Automatic failover
- No data loss during the failover process
- Fast failover
- Minimal (ideally zero!) impact on the downstream applications
In this article, I intend to explain how high availability is achieved in Solace PubSub+ Event Broker.
Advantages of High Availability in Solace PubSub+ Event Broker
Solace PubSub+ Event Broker has patented technology to provide high availability. Before we go into the details of high availability concepts in PubSub+ Event Broker, let us look at some significant advantages of Solace’s approach to high availability:
- You don’t need third-party software (OS clustering, Zookeeper, etc.) to configure high availability
- Message state and data are always in sync across active and standby brokers
- Since the state is maintained in both instances, failover is fast no matter how many messages are in the spool
- Failover is seamless, with no impact on downstream applications
High Availability Configuration
PubSub+ Event Broker comes in two major form factors: Appliance and Software. Both these form factors provide the high availability active/standby redundancy model. It is also available as a cloud-managed service through PubSub+ Cloud, but we handle all aspects of availability so I won’t talk about that here.
PubSub+ Event Broker: Appliance also provides active/active redundancy for the “direct/QoS0/at most once” messaging pattern. Since a large number of mission-critical applications are dependent on guaranteed or QoS1 messaging, in this article, we will go into the details of the Solace active/standby model redundancy for guaranteed messaging.
High Availability in PubSub+ Event Broker: Appliance
Two brokers of PubSub+ Event Broker: Appliance can be configured as a redundant pair. If one of the two is taken out of service or fails, its mate automatically takes over the responsibility. This guarantees 99.999% availability. The two appliances are directly connected using an optical mate-link. This makes the message replication latency as low as possible. All the configuration changes done in the primary broker are synched to the backup broker so that both the brokers remain the same at all the times. This redundancy is transparent to clients and other PubSub+ brokers in the network. Only the two Solace message routers paired as mates require explicit configuration to take advantage of this feature. No special configuration is needed on the client host computers except setting the reconnection parameters. The client connects only to a single IP. When the failover happens, the standby appliance takes over the primary IP. This is done using VRRP protocol.
Message Flow in Appliance High Availability Setup
The flow of the message from the publisher and subscriber perspective is as shown in the figure below:
- The publisher publishes the message. The active broker receives message, and it is persisted in non-volatile RAM, which is on the appliance.
- The message is then replicated and stored in non-volatile RAM of the redundant mate (backup appliance).
- Once the redundant mate confirms the receipt:
- acknowledgement is sent to the publisher that the message is received.
- if there is a subscriber of the message, the message is delivered immediately from the RAM.
- If there is a slow subscriber of the message, the message is spooled to the SAN disk asynchronously.
- The message is delivered to the slow subscriber when they are back online and able to process the messages.
The above mechanism ensures that PubSub+ Event Broker: Appliance can achieve the high throughput and lower latency since the fast subscriber always gets the message from the RAM. There is no disk in the message path.
High Availability in PubSub+ Event Broker: Software
PubSub+ Event Broker: Software can be configured as a high availability triplet (3 software brokers configured to provide active-hot standby setup). The third node acts as a quorum or monitoring node. If the active node fails for any reason, the backup node becomes active.
All the configuration changes done in the primary broker are synched to the standby broker so that the brokers remain the same at all the times. This redundancy is transparent to clients and other PubSub+ Event Brokers in the network. Only the two Solace message routers paired as mates require explicit configuration to take advantage of the feature. No special configuration is needed on the client host computers except setting the reconnection parameters. Both the active and standby nodes have separate IP addresses, and the client connection parameter has both IP addresses in the host list. This configuration is simple as it doesn’t require any 3rd party software. This guarantees up to 99.995% availability. However, the availability is also dependent on the underlying infrastructure.
Message Flow in Software High Availability Setup
The flow of the message from the publisher and subscriber perspective is as shown in the figure below:
- The publisher publishes the message. The active broker receives the message, and it is persisted to the disk on the local broker.
- The message and its state is then replicated to redundant mate and stored on the local disk. This is done over the network.
- Once the redundant mate confirms the receipt:
- acknowledgement is sent to the publisher that the message is received.
- if there is a subscriber of the message, the message is delivered immediately to it.
- The message state is in sync between the active and redundant brokers at all times.
All the disk writes are optimized to achieve high throughput even in software brokers.
Client Connection Configuration for High Availability Setup
For connecting to the broker in high availability configuration, the client application will have to provide a hostlist (multiple comma-separated IP addresses/hostnames) in its connection parameter. The client will try to connect using these entries until one of these succeeds.
For appliance high availability configuration, the client will see only one IP address (since the redundant broker takes over the IP during failover). Thus, for this case, the connecting application will have just one IP in its host-list.
For software high availability configuration, the client will have to provide the IP address of both the active and redundant brokers.
The reconnect and retries interval can be configured in the application. The architects and developers can read this blog post to get more details of this configuration.
Failover Scenario
Appliance
In case of an appliance failure, the failure will be detected immediately by the mate appliance pair. The failure detection is automatic and within a few seconds. During the failover, the standby broker takes over the primary IP (for appliance). The standby broker has exactly the same message state as the active broker. This failover is seamless to the client application. The API just gets a network blip on reconnect and message processing is per normal.
Software
If the primary software broker fails, the quorum node detects the failure and makes the hot standby broker active. Having these three brokers working together makes sure that there is no split brain during failover scenario. Bear in mind that there must be 2 of 3 nodes in contact to provide the failover service (it doesn’t matter which two). If you only have two nodes working, the messaging service will continue. However, failover will not be possible. The Standby broker has exactly the same message state as the active broker. This failover is seamless to the client application. The API just gets a network blip on reconnect and message processing is as per normal.
Conclusion
In this article, I have tried to explain how the high availability works in Solace PubSub+ Event Broker in reasonable detail. I hope this post has helped you understand how high availability is supported in Solace PubSub+ Event Broker. You can read our docs for configuration details. If you have any questions, post them to the Solace Developer Community.
In the subsequent post on this series, I will cover how we achieve disaster recovery in Solace PubSub+ Event Broker, the concept of RTO/RPO, and how to achieve multi-DC active-active setup.