The ability to replay previously-published data to some set of consuming applications is a common requirement of many enterprise messaging and data movement backbones. Often people think about replay in these narrow terms. They look at product capabilities and fit their needs into the functionality offered, instead of thinking holistically about the actual, frequently multiple, replay requirements of their business.
For example, I have heard discussions around using the Kafka broker as an enterprise message replay server because of its ability to replay messages. Apache Kafka is a write ahead commit log server that has been successful in aggregating typically low value data for ingestion into Big Data solutions. While Kafka technically supports replay, its capabilities are only a subset of the typical needs and you need to build tooling around it to make it useful.
In this blog I will discuss the many use cases and requirements for replay in an enterprise and compare Apache Kafka’s capabilities to those of a purpose built enterprise replay server, specifically Tradeweb’s Replay for Messaging for which Solace has a tight integration in the Replay for Solace edition.
If you want to “cut to the chase” and just see the comparison, jump to here.
A replay service can be a key component in a messaging architecture where it is not feasible to request the message source (producer) to re-publish any lost or missing messages so that one or more downstream consumers can back-fill any missing messages. Replay servers are also commonly used in QA or UAT to replay production data into a UAT environment at controlled message rates, and other data test scenarios.
Here are some typical example use cases for message replay:
In all the above high level cases, the requirements of the replay server seem simple and not too onerous. Simply receive message from a message broker, persist to disk, then replay to end client upon request. But as with most things, the details of the requirements will really start to show where the architectural trade-offs will make a difference.
The primary requirements for data capture are to be lossless and preserve message order across the defined data sets. To be lossless means that a failure of any component cannot cause message loss. Also to achieve lossless capture the replay server likely needs horizontal scaling to keep up with the ingress message rate while servicing replay requests. This leads to the second point, the message traffic must be divisible in a flexible enough fashion that you can easily build shards of the data that contain order dependent data.
To do proper filtering and retrieval of messages to republish, the replay server needs to be publish time and message id aware. This allows for replay based on something meaningful to the user such as time frames, for example: replay all messages from beginning of day until now. Also messageId aware replay servers can replay specific messages or exclude specific messages. Being able to replay from a certain byte offset in a stream, for example, is not particularly useful to users.
Along with being topic, time and message aware. There is often a need to further inspect messages for header fields that add context to a message, for example: replay all messages related to a specific transaction. This content aware, application-specific filtering may require “and|or|not” logic, such as: replay all messages related to a series of transactions that were produced from a specific publisher.
Once the correct replay filtering has been applied it is often required to be able to control replay rates as well as control destinations. The overall network, or consumer applications, might not be engineered for full rate replay, as this could be a 2x increase in network traffic, so rate limiting of replays is valuable to ensure network engineers understand the total impact of replay traffic during production. As well, replay messages will often be replayed into a new destination set up specifically for the purpose of the specific replay so as to not interfere with the ongoing original traffic, in these cases there might be requirement to replay traffic as fast as possible or again at a controlled rate independent of the original rate. Replay for test purposes has many requirements such as reply at original publish rate or 2x or 5x the original rate (which then simulate realtime burst behaviors) for system stress testing; it may require the ability to corrupt messages to generate negative test conditions on consumer applications; it may require that live traffic be delayed by specific amount of time, or exclude certain data.
There are several key resources that need to be monitored within a replay system. Eventing needs to be in place to indicate the replay server is falling behind realtime traffic before conditions such a publisher backpressure or message loss occurs. There needs to be health monitoring of individual message consumption flows as well as overall disk health checks before issues such as disk overflow occurs.
There needs to be the ability to provide replay job monitoring that includes capturing the start/end times of the job, job progress, average message throughput, average publish time, average message size and other statistics that seems appropriate; replay pause/restart/cancel control.
It must not be possible to defeat any kind of access control to real-time data by simply requesting a replay of that data. Unless otherwise defined the same restrictions on access to real-time data must apply to the replayed data, therefore it is important that the replay server be integrated into your overall data access authorization strategy. This means any programmatic requests must come from authenticated users and all replays must be subject to the messaging systems access control mechanisms.
The replay server is a message store that will assist in actual message management. As previously mentioned you will likely need to be able to search for messages based on time and message id, the user interface should also allow you to delete or modify specific message. Also the ability to view complete messages in the gui will help in building filters based on additional header properties.
Now that we understand some standard use cases and requirements, let’s next look at a feature comparison chart to see how well Tradeweb’s ReplayService for Solace server meets these requirements vs Apache Kafka. The reason to do this comparison is that Kafka is popular in the big data messaging space and developers are looking at this technology to solved messaging problems in the enterprise including message broker and message replay work.
Supported by replay server
Supported with additional tools or requires custom code by the customers.
|Capture from production replay into UAT|
|Able to efficiently use large amounts of cheap storage|
|Data capture losslessly|
|Data capture in publish order|
|Programmatic replay data with topic filtering|
|Programmatic replay data with hierarchical topic filtering|
|Programmatic replay data with header field filtering|
|Programmatic replay data with “and|or|not” header filtering|
|Programmatic replay data with publish time filtering|
|Programmatic replay data with new directed destinations|
|Handle multiple replay requests at the same time|
|Alert when ingress queues hits Warning and critical levels|
|Replay job monitoring|
|Avg replay rate|
|Administration UI to inspect messages|
|Administration UI to search/remove messages|
|Administratively control replay destination|
|Administratively control replay rates|
|Require authentication for programmatic replay|
|Apply access control rules to publish and receipt of replay data|
Apache Kafka is popular in use cases where consumers require all the data from the producer un-filtered, such as big data ingestion. The reason that these servers are popular is that they can make efficient use of disk I/O because of the sequential nature of data access. But, the requirement for replay in modern data movements systems is much larger than simple log replay, and require a full set of access controls, filtering and monitoring of data movement similar to what is provided by the Tradeweb ReplayService
Ken Barr is a Senior Product Integration Architect working with the Solace CTO group. He's focused on exploring areas in which our customers would benefit from Solace innovation, then defining how these new technologies fit into Solace’s product lines.
Prior to joining Solace, Ken was a Technical Lead at Cisco Systems holding several roles in carrier core routing business units including QA team lead for initial implementations of IPv6 and bringing next generation IOS called IOS-XR to the core routing platforms. Preceding Cisco, Ken was a Communications Electronics Engineering Officer for the Royal Canadian Air Force responsible for operational management of the National Defence Headquarters Metropolitan Area Network.[position] => [url] => https://solace.com/blog/author/ken-barr/ ) )