16 min read
by Balint Molnar
Published on 07/28/2019
Last updated on 02/05/2024
Published on 07/28/2019
Last updated on 02/05/2024
- Multi-cloud application management
- An Istio based automated service mesh for multi and hybrid cloud deployments
- Federated resource and application deployments built on Kubernetes federation v2
- We've added the support necessary to run Apache Kafka over Istio (using our Kafka and Istio operator, and orchestrated by Pipeline).
- Running Kafka over Istio does not add performance overhead (other than what is typical of mTLS, which is the same as running Kafka over SSL/TLS).
- With Pipeline, you can now create Kafka clusters across multi-cloud and hybrid-cloud environments.
- Follow the Right-sizing Kafka clusters on Kubernetes if you are interested in how to properly size your Kafka clusters
- Would you like to run Kafka over Istio the easy way - try Supertubes.
Check out Supertubes in action on your own clusters: Register for an evaluation version and run a simple install command! As you might know, Cisco has recently acquired Banzai Cloud. Currently we are in a transitional period and are moving our infrastructure. Contact us so we can discuss your needs and requirements, and organize a live demo. Evaluation downloads are temporarily suspended. Contact us to discuss your needs and requirements, and organize a live demo.Metrics preview for a 3 broker 3 partition and 3 replication factor scenario with producer ACK set to all:or read the documentation for details. Take a look at some of the Kafka features that we've automated and simplified through Supertubes and the Koperator, which we've already blogged about:
supertubes install -a --no-demo-cluster --kubeconfig <path-to-k8s-cluster-kubeconfig-file>
- Oh no! Yet another Kafka operator for Kubernetes
- Monitor and operate Kafka based on Prometheus metrics
- Kafka rack awareness on Kubernetes
- Running Apache Kafka over Istio - benchmark
- User authenticated and access controlled clusters with [Koperator]
- Kafka rolling upgrade and dynamic configuration on Kubernetes
- Envoy protocol filter for Kafka, meshed
- Right-sizing Kafka clusters on Kubernetes
- Kafka disaster recovery on Kubernetes with CSI
- Kafka disaster recovery on Kubernetes using MirrorMaker2
- The benefits of integrating Apache Kafka with Istio
- Kafka ACLs on Kubernetes over Istio mTLS
- Declarative deployment of Apache Kafka on Kubernetes
- Bringing Kafka ACLs to Kubernetes the declarative way
- Kafka Schema Registry on Kubernetes the declarative way
- Announcing Supertubes 1.0, with Kafka Connect and dashboard
Single Cluster Results
|Google GKE avg. disk IO / broker
|Amazon EKS avg. disk IO / broker
|Kafka with SSL/TLS
|Kafka with Istio
|Kafka with Istio and mTLS
Multi Cluster Results
|Kafka clusters with Istio mTLS
|Avg. disk IO / broker
|Avg. latency between clusters
|GKE eu-west1 <-> GKE eu-west4
|EKS eu-north1 <-> EKS eu-west1
|EKS eu-central1 <-> GKE eu-west3
Running Kafka over an Istio service meshThere is considerable interest within the Kafka community in the possibility of leveraging more Istio features via out-of-the-box tracing, and mTLS through protocol filters, though these features have different requirements as reflected in Envoy, Istio and on a variety of other GitHub repos and discussion boards. While we've already covered most of these features with Supertubes in the Pipeline platform - monitoring, dashboards, secure communication, centralized log collection, autoscaling, Prometheus based alerts, automatic failure recoveries, etc - there was one important feature that we and our customers missed: network failures and multiple network topology support. We've previously handled these with Backyards (now Cisco Service Mesh Manager) and the Istio operator. Now, the time has arrived to explore running Kafka over Istio, and to automate the creation of Kafka clusters across single-cloud multi AZ, multi-cloud and especially hybrid-cloud environments.
Getting Kafka to run on Istio wasn't easy; it took time and required heavy expertise in both Kafka and Istio. With more than a little hard work and determination, we accomplished what we set out to do. Then, because that's how we roll, we automated the whole process to make it as smooth as possible on the Pipeline platform. For those of you who'd like to go through the work and learn the gotchas - the what's whats, the ins and outs - we'll be following up with a deep technical dive in another post soon. Meanwhile, feel free to check out the relevant GitHub repositories.
The cognitive biasCognitive bias is an umbrella term that refers to the systematic ways in which the context and framing of information influence individuals’ judgment and decision-making. There are many kinds of cognitive biases that influence individuals differently, but their common characteristic is that—in step with human individuality—they lead to judgment and decision-making that deviates from rational objectivity. Since releasing the Istio operator, we've found ourselves in the middle of a heated debate over Istio. We had already witnessed a similar course of events with Helm (and Helm 3), and we rapidly came to realize that many of the most passionate opinions on this subject were not based on first hand experience. While we sympathize with some of the issues people have with the Istio's complexity - this was exactly our rationale behind open sourcing our Istio operator and the release of our Backyards (now Cisco Service Mesh Manager) product - we don't really agree with most performance-related arguments. Yes, Istio has lots of
convenient features you may or may not need and some of these might come with some added latency, but the question is, as always, is it worth it?
Note: yes, we've witnessed Mixer performance degradation and other issues while running a large Istio cluster with lots of microservices, policy enforcements, and raw telemetry data processing, and we share concerns about these; the Istio community is working on a
mixerless version - with features mostly pushed down to Envoy.
Be objective, measure firstBefore we could reach a consensus about whether or not to release these features to our customers, we decided to conduct a performance test. We did this using several test scenarios for running Kafka over an Istio-based service mesh. As you might be aware, Kafka is a data intensive application, so we wanted to test it with and without Istio, in order to measure its added overhead. Additionally, we've been interested in how Istio handles data intensive applications, where there is a constant high I/O throughput and all its components are maxed out.
We used a new version of our Koperator, which provides native support for Istio-based service meshes (version >=0.5.0).
Benchmark SetupTo validate our multi cloud setup we decided to benchmark Kafka first with various single Kubernetes cluster scenarios:
- Single Cluster 3 broker 3 topic with 3 partition and replication-factor set to 3 TLS disabled
- Single Cluster 3 broker 3 topic with 3 partition and replication-factor set to 3 TLS enabled
In another post we'll be releasing benchmarks for hybrid-cloud Kafka clusters, wherein we use our own Kubernetes distribution, PKE.We wanted to simulate a use case we often seen on our Pipeline platform, so we distributed nodes across availability zones, with Zookeeper and clients in different nodes as well. The following instance types were used:
Just FYI, Amazon throttles small instance type disks IO after 30 minutes for the rest of the day. You can read more about that, here.For storage we requested Amazon's provisioned IOPS SSD(io1), which on the instances listed above can reach 437MB/s throughput, continuously.
pd-ssd, which can reach
400MB/s according to Google's documentation.
Kafka and the Load ToolFor Kafka, we used 3 topics, with partition count and replication factor set to 3. For the purpose of testing we used default config values, except
In the benchmark we used our custom built Kafka Docker image
banzaicloud/kafka:2.12-2.1.1. It uses Java 11, Debian and contains Kafka version 2.1.1. The Kafka containers were configured to use 4 CPU cores and 12GB RAM, with a Java heap size of 10GB.
banzaicloud/kafka:2.12-2.1.1 image is based on the wurstmeister/kafka:2.12-2.1.1 image, but we wanted to use java 11 instead of 8, for SSL library improvements.The load was generated using sangrenel, a small Go-based Kafka performance tool, configurated as follows:
- message size of 512 bytes
- no compression
- required-acks set to all
- workers set to 20
Creating the infrastructure for the benchmark is beyond the scope of this blog, but if you're interested in reproducing it, we suggest using Pipeline and visiting the Koperator GitHub repo for more details.
Benchmarking the environmentBefore getting into Kafka's benchmark results, we also benchmarked our environments. As Kafka is an extremely data intensive application, we gave special focus to measuring disk speed and network performance; based on our experience, these are the metrics that most affect Kafka. For network performance, we used a tool called
iperf. Two identical Ubuntu based Pods were created: one, a server, the other, a client.
- On Amazon EKS we measured
3.01 Gbits/secof throughput.
- On Google GKE we measured
7.60 Gbits/secof throughput.
dd on our Ubuntu based containers.
- We measured
437MB/son Amazon EKS (this is exactly inline with what Amazon offers for that instance and ssd type).
- We measured
400MB/son Google GKE (this is also inline with what Google offers for its instance and ssd type).
Google GKE Kafka on Kubernetes - without IstioAfter the results we got on EKS, we were not surprised that Kafka maxed disk throughput and hit
417MB/s on GKE. That performce was limited by the instance's disk IO.
Kafka on Kubernetes with TLS - still without IstioOnce we switch on SSL/TLS for Kafka, as expected and as has been benchmarked many times, a performance loss occured. Java's well known for the poor performance of its SSL/TLS (otherwise pluggable) implementatation, and for the performace issues it causes in Kafka. However, there have been improvements in recent implementations (9+), accordingly, we upgraded to Java 11. Still, the results were as follows:
274MB/sthroughput ~30% throughput loss
- an increase of ~2x in the packet rate, compared to non TLS
Kafka on Kubernetes - with IstioWe were eager to see whether there was any added overhead and performance loss when we deployed and used Kafka in Istio. The results were promising:
- No performance loss
- Again slight increase on the CPU side
Kafka on Kubernetes - with Istio and mTLS enabledNext we enabled mTLS on Istio and reused the same Kafka deployment. The results are better than they were for the Kafka on Kubernetes with SSL/TLS scenario.
323MB/sthroughput ~20% throughput loss
- ~2x packet rate increase compared to non TLS
Amazon EKS Kafka on Kubernetes - without IstioWith this setup we achieved a considerble write rate of
439MB/s, which, if messages are 512 bytes, is
892928 Messages/second. In point of fact, we maxed out the disk throughput provided by AWS for the
r5.4xlarge instance type.
Kafka on Kubernetes with TLS - still without IstioOnce we switched on SSL/TLS for Kafka, again, as was expected and has been benchmarked many times, a performance loss occured. Java's SSL/TLS implementatation performance issues are just as relevant on EKS as on GKE. However, like we said, there have been improvements in recent implementations. Accordingly, we upgraded to Java 11 but the results were as follows:
306MB/sthroughput, which is a ~30% throughput loss
- an increase in ~2x packet rate, compared to non TLS scenarios
Kafka on Kubernetes - with IstioAgain, just as before, the results were promising:
- no performance loss occured
- there was a slight increase on the CPU side
Kafka on Kubernetes - with Istio and mTLS enabledNext we enabled mTLS on Istio and reused the same Kafka deployment. The results, again, are better than for Kafka on Kubernetes with SSL/TLS.
340MB/sthroughput, which is a throughput loss of around 20%
- increased packet rate, but lower than a factor of ~2x
Bonus track - Kafka on Linkerd (without mTLS)We always test all our available options, so we wanted to give this a try with Linkerd. Why? Because we could. While we know that Linkerd can't meet our customers' expectations in terms of available features, we still wanted to give it a try. Our expectations were high, but the numbers produced gave us a hard lesson and a helpful reminder in what, exactly,
cognitive bias is.
Single Cluster conclusionBefore we move on to our multi-cluster benchmark, let's evaluate the numbers we have already. We can tell that, in these environments and scenarios, using service mesh without mTLS does not affect Kafka's performance. The throughput of the underlying disk limits the performance before Kafka hits network, memory or cpu limits. Using TLS creates a ~20% throughput degradation in Kafka's performance, whether using Istio or Kafka's own SSL/TLS lib. It slightly increases the CPU load and roughly doubles the number of packets transmitted over the network.
Note that just enabling the mTLS on the network caused a ~20% degredation during the infrastructure test with
iperf as well
Multi-cluster scenario with topics replicated across "racks" (cloud regions)In this setup we are emulating something closer to production, wherein, for the sake of reusing environmental benchmarks, we stick with the same AWS or Google instances types, but set up multiple clusters on different regions (with topics replicated across cloud regions). Note that the process should be the same, whether we use these multiple clusters across a single cloud provider or across multiple or hybrid clouds. From the perspective of Backyards (now Cisco Service Mesh Manager) and the Istio operator there is no difference; we support 3 different network topologies. One of the clusters is larger than the other, as it consists of 2 brokers and 2 Zookeeper nodes, whereas the other will have one of each. Note, in a single mesh multi-cluster environment enabling mTLS is an absolute must. Also, we set
min.insync.replicas to 3 again and the producer ACK requirement to all for durability.
The mesh is automated and provided by the Istio operator.
Google GKE <-> GKEIn this scenario we created a single mesh/single Kakfa cluster that spanned two Google Cloud regions: eu-west1 and eu-west4
Amazon EKS <-> EKSIn this scenario we created a single mesh/single Kakfa cluster that spanned two AWS regions: eu-north1 and eu-west1
Google GKE <-> EKSIn this scenario we created a single Istio mesh, across multiple clusters that spanned multiple clouds, forming one single Kafka cluster (Google Cloud region is europe-west-3 and AWS region is eu-central-1). As expected, the results were considerably poorer.
Multi Cluster conclusionFrom our benchmarks, we can safely say that it's worth it to give using Kafka in a multi-cloud single-mesh environment a shot. People have different reasons for choosing an environment like Kafka over Istio, but the ease of setup with Pipeline, the additional security benefits, scalability and durability, locality based load balancing and lots more makes it a perfect choice. As already mentioned, one of the next posts in this series will be be about benchmarking/operating an autoscaling hybrid-cloud Kafka cluster, wherein alerts and scaling events are based on Prometheus metrics (we do something similar for autoscaling based on Istio metrics for multiple applications, which we deploy and observe through the mesh - read this older post for details: Horizontal Pod Autoscaling based on custom Istio metrics.)
About BackyardsBanzai Cloud’s Backyards (now Cisco Service Mesh Manager) is a multi and hybrid-cloud enabled service mesh platform for constructing modern applications. Built on Kubernetes and our Istio operator, it gives you flexibility, portability, and consistency across on-premise datacenters and cloud environments. Use our simple, yet extremely powerful UI and CLI, and experience automated canary releases, traffic shifting, routing, secure service communication, in-depth observability and more, for yourself.
About Banzai Cloud PipelineBanzai Cloud’s Pipeline provides a platform for enterprises to develop, deploy, and scale container-based applications. It leverages best-of-breed cloud components, such as Kubernetes, to create a highly productive, yet flexible environment for developers and operations teams alike. Strong security measures — multiple authentication backends, fine-grained authorization, dynamic secret management, automated secure communications between components using TLS, vulnerability scans, static code analysis, CI/CD, and so on — are default features of the Pipeline platform.
About Banzai CloudBanzai Cloud is changing how private clouds are built: simplifying the development, deployment, and scaling of complex applications, and putting the power of Kubernetes and Cloud Native technologies in the hands of developers and enterprises, everywhere. #multicloud #hybridcloud #BanzaiCloud
Get emerging insights on emerging technology straight to your inbox.
Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach
Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.
The Shift keeps you at the forefront of cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations that are shaping the future of technology.