Back in the early days of market data systems, 10 MbE networks were the standard. Multicast from the publisher to the subscriber was a clever way to optimize bandwidth to trading desks, because there just wasn’t enough bandwidth to send the data repetitively to each trader. The emergence of 100 MB and 1 GigE networks resolved some of the bandwidth issues, but the delivery bottlenecks just shifted to the software middleware, so multicast lives on.
But multicast has a dark side. If you want to see any Wall Street or internet infrastructure architect get worked up, ask them about multicast storms. Multicast storms happen when application participants request retransmits of information they have missed in the multicast stream. There are two common causes of multicast storms:
As market data rates accelerate and trade volumes go through the roof, many people are counting on 10 GigE to bail them out just as things get really hairy. Unfortunately, the migration to 10 GigE won’t be immediate, and it won’t be universal, so the emergence of 10 GigE will actually exacerbate the second common cause of multicast storms. In context of those increasing market data rates and trade volumes, some predict a “perfect storm” in the world of trading systems.
If you’re pretty sure you grok the problems with multicast, you can skip to the solution below, otherwise, here’s a little more context on the causes before I discuss the solution.
A slow consuming application will cause multicast packets to be lost before they are consumed by the application, which will result in a request for retransmission. This puts extra burden on the publisher which continues publishing new data and also must publish retransmissions of missed data. If there are just a few slow consumers, this is usually no big deal.
Most trading floor networks have many layer 2 links between the publisher and subscriber. The network architecture may use 10 GigE networks near the information source (market data feeds), then have a 1 GigE LAN for the trading floor, maybe a 100 MbE LAN for back office systems and a 10 MbE or less network over the WAN. Whenever a burst of traffic from the 10 GigE source hits a transition to a slower link, let’s say 2 gigs of traffic, the layer 2 network has to absorb the burst and feed it through that switch as quickly as possible. Generally, this results in a bunch of packet loss due to a lack of buffering capacity in the layer 2 switch. All the subscribers on the other side of that network switch will soon recognize that they missed a bunch of traffic and will be sending retransmit requests back to the publisher. This results in an increase in multicast traffic from the publisher, which causes more bursts, which causes more packet losses, which causes more retransmit requests, which causes more retransmits, and the circle goes on. Soon the majority of traffic is related to retransmissions – a classic multicast storm.
Adding insult to injury, many trading applications require messages to be received in order, which means if a message is missed, that application has to buffer all other incoming messages until the missing messages can be recovered. Let’s say it takes a few hundred milliseconds to get the retransmitted message. Every client waiting for that missing message then has to buffer all incoming messages. If they can’t, you get a ton of message loss. And even if they can, they have to process potentially thousands of messages before they can catch up to the incoming stream of data. Guess what they call that? Yep, that’s a slow consumer, sometimes a whole network full of slow consumers — which can cause message loss, which causes more retransmit requests, which causes more multicast traffic, which balloons latency, which eventually brings down entire trading floors.
Get the picture? When a multicast storm strikes, trading firms lose perfectly good money, and good architects and operations staff lose perfectly good jobs. The bad news is that as firms that use multicast begin the upgrade to 10 GigE, they could face more multicast storms because inevitably they won’t upgrade the entire network at once. This will result in many network speed mismatches, while inviting more bandwidth use near the information sources. At a 10GigE to 1GigE junction, you can lose a lot of packets in a very short time given the 10x speed mismatch.
The good news is that hardware middleware gets rid of the software messaging bottlenecks that are keeping multicast alive. One of the big “a ha!” moments in every Solace messaging architecture review with customers or prospects is when they understand that today’s 1 GigE and 10 GigE networks are more than enough bandwidth to unicast each subscriber a custom market data feed using TCP and therefore do not need to blow out the doors on a uniform layer 2 network or complex gateways between subnets with different speeds. This solves all the major problems with multicast:
That’s a much more sane environment, why wouldn’t every product do it this way? To do this you need a messaging product that can handle many more messages per second with very low latency. Software has hit its limit in the neighborhood of (optimistically) a million messages per second. At rates higher than that, latency and predictability become wildly inconsistent and unpredictable. This is a fundamental limit that is caused by the context switching between the operating system, network stack, and the application code within a server.
A 100% hardware datapath, on the other hand, does not have any software or operating system, and as a result avoids all context switching. That’s how we get higher rates and lower latencies at real-world volumes than any software stack can achieve, and it’s why the move to hardware-based middleware is inevitable and is accelerating in the financial user community as well as in other high performance computing environments.