Deconstructing Kafka

iStock_000020380404XSmall-300x199A colleague of mine asked me to “compare and contrast” Solace with Kafka. Apache Kafka is an open source distributed messaging system, written by LinkedIn specifically for activity stream and log processing. As a detail oriented techie, I hate it when someone asks me a one sentence question that requires a two page answer. Sometimes I wish I could feel good about myself while spouting weasel words like “robust”, “scalable”, and “carrier class” to explain why my employer’s products are different. I am just not that guy. Instead, I got out my green highlighter and my reading glasses and spent the afternoon dissecting the excellent whitepaper written by Kreps, Narkhede, and Rao.

First it’s important to introduce what constitutes large scale in the log processing space and why there is a need for solutions like Kafka. These are some log processing metrics quoted in the paper.

  • LinkedIn: generates 100’s of GB of new data in the form of 1 Billion messages per day
  • China Mobile: generates 5-8 TB of Call Detail Records (CDRs)  per day
  • Facebook: generates 6TB of user activity data per day

These huge volumes of data are generated across thousands of source systems distributed in multiple data centers, so it’s no easy feat to reliably capture the torrent of data, consolidate it and route it to the various analytics engines and back-end applications that feed on log data as the raw material for search, fraud detection, targeted advertising, and social networking.

Sounds like a good job for a message broker, either in the form of open source like RabbitMQ and Apache ActiveMQ, or a commercial product like IBM WebSphere MQ or TIBCO EMS.

The four main reasons outlined in the paper for why traditional enterprise messaging offering are not a fit for LinkedIn’s use case are:

  1. Feature bloat and the overkill of unnecessarily strong message delivery guarantees
  2. Not focused on throughput
  3. Weak clustering/partitioning support
  4. Poor message spool handling under heavy load

This is when I started to get really intrigued because the items in the list above are the very same ones that I use to qualify a fit for a Solace Messaging Appliance. Log processing, and the similar application in sensor networks, are becoming common and repeatable use cases.

Let’s look quickly at how the four needs are addressed in a hardware implementation:

  1. Bloat – Reliable message delivery with batching – Solace has two simple tiers of delivery quality (best effort, and guaranteed) with configurable sliding windows, ACKs, and vectored send (i.e. batching) for high throughput in a LAN or WAN environment (where the RTT will kill most JMS implementations). Dedicated PCI cards handle the different qualities of service so that traffic from one does not impact the performance of the other. If guaranteed delivery is not required, the extra card is not installed and there is no “feature bloat” or unnecessary code paths in the broker.
  2. Throughput – Hardware offload is perfect for maximizing throughput. Offloading the Linux kernel from interrupts for disk, network, memory, context switching, is how the Solace 3260 achieves 10/40 Gbps wirespeed throughput on a single box.
  3. Clustering – Interconnected nodes with no master hub and with dynamic but efficient message routing and balancing protocols are built in. FPGAs and Network Processors (like those found in F5 Load Balancers and Cisco Routers) can be reprogrammed to do high speed inter-node Layer 7 routing between the nodes in a network of message brokers without impacting the performance of each node to handle client producers or consumers. By dedicating the balancing to bare metal silicon, it is possible to do continuous balancing at higher speed than periodic balancing done in software.
  4. Message spooling –  Solace appliances accelerate the persistence of  messages in a novel tiered non-volatile message spool without suffering from the hit of slow spinning hard disk I/O for every message. Flash and SSDs don’t help if you treat them like hard disks and write every message because the constant writing creates longevity problems very quickly. Solace uses RAM backed by Super-Capacitors and RDMA memory replication within clustered nodes to allow huge queues to build up and then efficiently dequeue at a later time. This enables the “snake eating a pig” use cases where messages can bulge temporarily in the message spool until downstream processing kicks in to digest the data. This capability makes for an excellent buffer or shock absorber for log data streams flowing into batch oriented big data frameworks like Hadoop.

So now let’s look at the actual performance. The benchmark testing in the Kafka paper matches with the proof of concept testing I have seen with ActiveMQ and RabbitMQ. Basically anything with durable, persistent, clustered operation puts a heavy load on general purpose Linux servers and yields no more than 10, 000 – 20, 000 msg/sec in or out. The Kafka numbers are impressive and proof that the optimizations and tradeoffs made are effective.

Kafka Perf test

–          2 Linux servers with 8 cores, 1GigE, 16GB RAM

–          200 bytes message size

–          Tested both batches of 1 or 50 messages per batch

–          Pub rate of 50, 000 msg/sec (batches of 1)

–          Pub rate of 400, 000 msg/sec (batches of 50)

  • Said another way this is 8, 000 batches/sec @ 10 KB batch size (50*200 bytes)
  • Said another way this is ~0.7 Gbps wire speed throughput

–          Sub rate of 22, 000 msg/sec

Solace Perf test

–          1 Solace 3260 Appliance w/ Fedora Core Linux control plane with 8 cores, 1GigE, 16GB RAM

  • Solace Topic Routing PCI Card
  • Solace Assured Delivery PCI Card
  • Solace 10 GigE Network Acceleration PCI Card

–          200 bytes message size

–          Pub rate of 206, 400 msg/sec (batches of 1)

–          Pub rate of 2, 670, 000 msg/sec (batches of 50)

  • Said another way this is 53, 400 batches/sec @ 10KB batch size (50*200 bytes)
  • Said another way this is ~4.4 Gbps wire speed throughput

–          Sub rate of 206, 400 msg/sec

So by most metrics the Solace 3260 appliance is 6.7 – 9.4 times higher throughput than Kafka even with the higher quality of service provided with the once-and-only-once-in-order “assured delivery”.  Lessening the restriction on guaranteed delivery some more would yield a further 25 times increase in throughput (to ~11 Million msg/sec, in+out, batches of 1). While it is true that the Solace benchmark uses 10 Gbps Ethernet rather than the 1 Gbps Ethernet used in the Kafka test, the network bandwidth was not the bottleneck in either test so the performance difference can be attributed to the acceleration provided by the Solace Appliance and associated Solace PCI cards offloading the Linux kernel.

So to summarize, the architecture of Kafka does an excellent job of optimizing generic message oriented middleware into a lean and powerful log processing framework. The lessons learned and approaches used are equally applicable in firmware, rather than software, and can yield yet another leap in overall throughput. Features such as compression, replication, and stream processing can be efficiently added inline in firmware. The benefit of the modular approach to a hardware based implementation makes it possible to add back in extra features and functions and enable a common shared infrastructure for generic messaging as well as log processing.