What Developers Need to Know About Solace High Availability and Disaster Recovery

An incident at one of our investment banking customers prompted me to write this blog. Solace was recently called for help when one of their application silently died. Solace has made High Availability (HA) & Disaster Recovery (DR) very simple and built into the product itself. The objective of this blog is to help the reader understand those capabilities, how to set them up including configuration of things like reconnection attempts and timeouts.

I have used Java as the programming language. If you’re using another programming language you’ll want to refer to the appropriate user manual.

What is HA and DR?

solace-ha-for-sumeet-blog

  • High availability (HA) Two Solace message routers can be configured as a redundant pair, so that if one of the two is taken out of service or fails, its mate automatically takes over responsibility for its clients. This redundancy is largely transparent to clients and other Solace message routers in the network. Only the two Solace message routers that are paired as mates require explicit configuration to take advantage of the feature—there is no special configuration needed on client host computers except setting the reconnection parameters that are explained later in this article below.
  • solace-dr-for-sumeet-blogDisaster recovery (DR) sets up a Solace-based system to gracefully recover from a catastrophic event that renders an entire datacenter unreachable by replicating messages, their state and configuration information to a pair of backup message routers. When Solace message routers at two different sites are configured to replicate traffic from one VPN to another, if one of the two sites is compromised or completely down, the second site can take over the responsibility for the clients associated with the primary site, either manually or via scripting. See the core concepts discussion to learn more.

Host Lists & How to Use Them

The “HOST” property is used by the application to specify the IP address (or host name)  of the Solace message router to connect to. A host entry has the following form:

[Protocol:]Host[:Port]

Protocol is the protocol used for the transport channel. The valid values are:

  • tcp Use a TCP channel for communications between the application and its peers. If no protocol is set, tcp is used as a default.
  • tcps Use a TLS channel over TCP for communications between the application and its peers. Encryption with compression is not supported.
JCSMPProperties properties = new JCSMPProperties();
properties.setProperty(JCSMPProperties.HOST,  "tcp:10.20.30.1");
…

With Solace Guaranteed Messaging you deploy Solace message routers in HA pairs which appear as a single HOST to the client applications. Hence you have only one IP address, as only one host entry is required in the session HOST property.

For DR scenarios, the host list feature of the Solace messaging APIs provides messaging clients with the IP addresses or host names of the Solace message routers in both of the Replication sites. This enables clients to successfully failover to a disaster recover site. By default, only a Solace message router with Message VPNs that have a Replication active state will allow the clients to connect. So during a temporary loss of connectivity to the routers at one Replication site, client applications won’t inadvertently connect to the routers at the other site as they traverse the host list while attempting to reestablish a connection.

Multiple host entries (up to four) separated by commas are allowed. With multiple entries, each is tried in turn until one succeeds.

JCSMPProperties properties = new JCSMPProperties();
properties.setProperty(JCSMPProperties.HOST,  "10.20.30.1:55555,  10.20.30.2:55555");
…

When a connection is attempted, the API first attempts to connect to 10.20.30.1. If that connection fails for any reason, it attempts to connect to 10.20.30.2. This process is repeated until all other entries in the host list are attempted.

After each entry has been attempted, if all fail, the channel properties  ConnectRetries, ReconnectRetries, and ReconnectRetryWaitInMillis determine the behavior of the API. If ConnectRetries is anything other than zero, the API waits for the amount of time set for ReconnectRetryWaitInMillis, then starts its connection attempts again from the beginning of the list. When traversing the list, each entry is attempted the number of times set for the ConnectRetriesPerHost property + 1.
If an established session to any host in the list fails, when ReconnectRetries is non-zero, the API automatically attempts to reconnect, starting at the beginning of the list.
Notes:

  • There is a small possibility that under high traffic rates or unfortunate timing of a switch-over to the standby site, some messages could be duplicated following a switch-over. It is recommended that applications that cannot tolerate duplicate message delivery under any scenario should implement application-layer mechanisms (for example, globally-unique message IDs) to detect duplicate message delivery.
  • When a Message VPN that has a Replication active state is switched to Replication standby, all active clients are disconnected.

Customizing Reconnection Retries and Timeouts

Before configuring your reconnection and timeout settings, you should have a solid understanding of JCSMPChannelProperties class which includes the set of properties required to create a channel connection with Solace routers.

For the scope of this blog post, an application must have the following reconnection properties correctly set so the Solace APIs can automatically reestablish the connection with the Solace messaging router. Therefore, it is important you understand the correct usage of the below-mentioned reconnection properties.

reconnectRetries

The value of this property corresponds to the number of times the APIs should attempt to reconnect to the Solace message router (or the list of Solace message routers) after the initial connected session goes down.

The default value for this property is 3, which means the APIs will automatically attempt to reconnect 3 times before giving up. Valid values are >= -1. -1 means “retry forever” which obviously isn’t a good setting as detection of failure is better than trying to connect indefinitely. “0” means no automatic reconnection retries (that is, try once and give up).

JCSMPProperties properties = new JCSMPProperties();
properties.setProperty(JCSMPProperties.HOST,  conf.getHost());
…
// Channel properties
JCSMPChannelProperties cp = (JCSMPChannelProperties) properties .getProperty(JCSMPProperties.CLIENT_CHANNEL_PROPERTIES);
…
cp.setReconnectRetries(5);
…

connectRetries

The Connect Retries property sets the number of times to retry to establish an initial connection for a Session to a host router. For example, setting the connect retries value to 3 in the Java API results in a maximum of three connection attempts: the initial attempt and two retries.

Valid values are >= -1. Zero means no automatic connection retries (that is, try once and give up). -1 means “retry forever”.

JCSMPProperties properties = new JCSMPProperties();
properties.setProperty(JCSMPProperties.HOST,  conf.getHost());
…
// Channel properties
JCSMPChannelProperties cp = (JCSMPChannelProperties) properties .getProperty(JCSMPProperties.CLIENT_CHANNEL_PROPERTIES);
…
cp.setConnectRetries(5);
…

reconnectRetryWaitInMillis

The value of this property corresponds to the number of milliseconds to wait between each attempt to connect or reconnect to a host. If a connect or reconnect attempt to host is not successful, the API waits for the amount of time set for reconnectRetryWaitInMillis, and then makes another connect or reconnect attempt.

The default value for this property is 3000, which means by default, the APIs will wait for 3 seconds between each attempt to connect/reconnect to a host. Valid values are 0 – 60000.

Note that connectRetriesPerHost sets how many connection or reconnection attempts can be made before moving on to the next host in the list.

JCSMPProperties properties = new JCSMPProperties();
properties.setProperty(JCSMPProperties.HOST,  conf.getHost());
…
// Channel properties
JCSMPChannelProperties cp = (JCSMPChannelProperties) properties .getProperty(JCSMPProperties.CLIENT_CHANNEL_PROPERTIES);
…
cp.setReconnectRetryWaitInMillis(3000);
…

connectRetriesPerHost

The value of this property corresponds to the number of times reconnection to a single host will be attempted before moving to the next host in the list.

The default value for this property is 0 which means the APIs will make a single connection attempt. Valid values are >= -1. -1 means attempt an infinite number of reconnect retries, meaning the API will only ever try to connect or reconnect to first host listed. Note that this property works in conjunction with the connect and reconnect retries settings; it does not replace them.

JCSMPProperties properties = new JCSMPProperties();
properties.setProperty(JCSMPProperties.HOST,  conf.getHost());
…
// Channel properties
JCSMPChannelProperties cp = (JCSMPChannelProperties) properties .getProperty(JCSMPProperties.CLIENT_CHANNEL_PROPERTIES);
…
cp.setConnectRetriesPerHost(20); 
…

Reconnection Logic – Beyond the default settings

When using HA redundant Solace message router pairs, a failover from one Solace message router to its mate will typically occur in seconds, but applications should attempt to reconnect for at least five minutes. To allow for a reconnect duration of 5 minutes for HA redundant Solace message routers, set the following session property values:

JCSMPProperties properties = new JCSMPProperties();
properties.setProperty(JCSMPProperties.HOST,  conf.getHost());
…
// Channel properties
JCSMPChannelProperties cp = (JCSMPChannelProperties) properties .getProperty(JCSMPProperties.CLIENT_CHANNEL_PROPERTIES);
cp.setConnectRetries(1);
cp.setReconnectRetries(5); 
cp.setReconnectRetryWaitInMillis(3000);
cp.setConnectRetriesPerHost(20);

Summary

In the case I mentioned above, the customer’s application had been configured with incorrect session reconnect properties so the application died silently after just a few reconnection attempts. Unfortunately the application had no logging and no one was monitoring its health, so it went unnoticed. (This highlights the importance of monitoring applications via logs or other mechanisms which you can learn about here.)

If you’re a developer, architect or QA person responsible for leveraging, setting up or testing HA and DR within a Solace environment, I recommend you go through product documentation to fully understand the relevant features and functions. Here are some links to get you started:


Edits

  • 8/23/16: Clarified behaviour of the API during initial connection establishment.