Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
PRODUCT
12 min read
Share
Apache Kafka was designed with a heavy emphasis on fault-tolerance and high-availability in mind, and thus provides different methods of ensuring enterprise-grade resiliency such as:
These features alone, allow us to set up Kafka clusters which can tolerate most of the failure event types that might occur in a distributed system. Unfortunately, these do not safeguard against catastrophic events like losing all the brokers that host the replicas of a topic at once, and which result in data loss - this scenario is covered in our Kafka disaster recovery on Kubernetes using MirrorMaker2 blog post. Our Supertubes product solves this problem for our customers, since it provides disaster recovery for the Kafka clusters it is provisioned on in Kubernetes. Currently, it includes two different disaster recovery methods which can be used, either separately or combined, to achieve the desired level of fault-tolerance. One is based on backing up the volumes used by brokers, and the restore brokers from those backups. The other relies on cross-cluster replication between remote Kafka clusters using MirrorMaker2. In this post we'll cover disaster recovery based on volume backup and restoration. We explore the method based on MirrorMaker2 in a separate post.
Check out Supertubes in action on your own clusters: Register for an evaluation version and run a simple install command! As you might know, Cisco has recently acquired Banzai Cloud. Currently we are in a transitional period and are moving our infrastructure. Contact us so we can discuss your needs and requirements, and organize a live demo. Evaluation downloads are temporarily suspended. Contact us to discuss your needs and requirements, and organize a live demo.
supertubes install -a --no-demo-cluster --kubeconfig <path-to-k8s-cluster-kubeconfig-file>
or read the documentation for details. Take a look at some of the Kafka features that we've automated and simplified through Supertubes and the Koperator, which we've already blogged about:
- Oh no! Yet another Kafka operator for Kubernetes
- Monitor and operate Kafka based on Prometheus metrics
- Kafka rack awareness on Kubernetes
- Running Apache Kafka over Istio - benchmark
- User authenticated and access controlled clusters with [Koperator]
- Kafka rolling upgrade and dynamic configuration on Kubernetes
- Envoy protocol filter for Kafka, meshed
- Right-sizing Kafka clusters on Kubernetes
- Kafka disaster recovery on Kubernetes with CSI
- Kafka disaster recovery on Kubernetes using MirrorMaker2
- The benefits of integrating Apache Kafka with Istio
- Kafka ACLs on Kubernetes over Istio mTLS
- Declarative deployment of Apache Kafka on Kubernetes
- Bringing Kafka ACLs to Kubernetes the declarative way
- Kafka Schema Registry on Kubernetes the declarative way
- Announcing Supertubes 1.0, with Kafka Connect and dashboard
If all the replicas of a topic are lost due to the simultaneous failure of all the brokers those replicas are distributed over, the messages stored under that topic are gone and can't be recovered. The failure mechanisms provided by Kafka cannot handle this situation, so we need a solution that gives us the ability to back up the volumes that periodically store broker data. The concept of Container Storage Interface (CSI) to add support for pluggable volumes has already been introduced to Kubernetes. What makes CSI interesting for us? Volume Snapshot and Restore Volume from Snapshot feature has been added to Kubernetes, but is only supported with CSI Volume plugins.
Note: in order to use volume snapshots beside a CSI driver running on the cluster, the following pre-requisites must be met:
- enable flag --allow-privileged=true for kubelet and kube-apiserver
- enable kube-apiserver feature gates --feature-gates=CSINodeInfo=true,CSIDriverRegistry=true,CSIBlockVolume=true,VolumeSnapshotDataSource=true
- enable kubelet feature gates --feature-gates=CSINodeInfo=true,CSIDriverRegistry=true,CSIBlockVolume=true
Kubernetes moves away from the in-tree storage plugins to the CSI driver plugins model as adding support for new volume plugins to Kubernetes was challenging with the former model. In Kubernetes v1.17 the CSI migration (introduced in Kubernetes v1.14 as alpha feature) has been made beta to help the in-tree to CSI migration effort. When CSI migration is enabled, existing stateful deployments and workloads continue to function as they always have; however, behind the scenes Kubernetes hands control of all storage management operations (previously targeting in-tree drivers) to CSI drivers. Since *in-tree** volume plugins will be removed completely in future Kubernetes versions most likely the new volume related features will be implemented only in CSI drivers. Just think of the already mentioned volume snapshots and volume restore from snapshot. But there are others like volume cloning and volume expansion, the later one Kafka can benefit from as it makes possible the resizing of the volume under a broker when needed and doesn't involves a broker configuration change in contrast to attaching additional volumes to broker. Supertubes leverages volume snapshots to backup and restore persistent volumes used by Kafka broker(s).
The user enables periodic backups for a Kafka cluster by creating a KafkaBackup
custom resource into the namespace where the Kafka cluster is running. Example KafkaBackup custom resource:
apiVersion: kafka.banzaicloud.io/v1beta1
kind: KafkaBackup
metadata:
name: kafka-hourly
spec:
clusterRef: kafka # name of the Kafka cluster this backup is for
schedule: "@every 1h" # run backup every hour
retentionTime: "12h" # delete backups older than 12 hours
startingDeadlineSeconds: 300 #do not run backup of missed scheduled time by more than 300 seconds
volumeSnapshotClassName: "csi-aws-vsc"
VolumeSnapshotClass
available on the Kubernetes clusterSupertubes creates volume snapshots from all the persistent volumes used by Kafka brokers to store data according to the schedule specified in the KafkaBackup
custom resource. In the event of losing a persistent volume that stores Kafka broker data, Supertubes provisions a new Persistent Volume in place of the lost one and restores data from the most recent snapshot that was taken from the lost Persistent Volume, if any are available. A volume snapshot represents a point-in-time copy of a volume. This means that the restored Persistent Volume may not contain most of the recent messages that have been ingested after snapshot creation was initiated. Thus, our restored broker may be missing some of the most recent messages. The missing messages can be synced either from other brokers, provided these have a replica of the missing messages, or from a remote Kafka cluster, which we'll cover how to do in a separate post - the same one in which we'll present disaster recovery using MirrorMaker2. To better understand this, let's look at a few sample scenarios:
It's time to explore this method of disaster recovery through a simple example. Since the aforementioned feature gates required by the CSI driver are enabled by default to start from Kubernetes version 1.17, we are going to provision a cluster using Kubernetes version 1.17 on AWS, using our hybrid cloud container management platform, Banzai Cloud Pipeline and PKE, our CNCF certified Kubernetes distribution.
Note that with Banzai Cloud Pipeline, you can provision Kubernetes clusters across five clouds and on-oprem (VMware, bare metal)
Register for an evaluation version and run the following command to install the CLI tool:
As you might know, Cisco has recently acquired Banzai Cloud. Currently we are in a transitional period and are moving our infrastructure. Contact us so we can discuss your needs and requirements, and organize a live demo.
Evaluation downloads are temporarily suspended. Contact us to discuss your needs and requirements, and organize a live demo. Deploy CSI driver to our PKE cluster In accordance with the Amazon Elastic Block Store (EBS) CSI driver set up the CSI driver and create a StorageClass
, as well as a SnapshotClass
that uses the CSI driver. Deploy Supertubes to our PKE cluster
supertubes install -a --no-democluster -c <path-to-k8s-cluster-kubeconfig-file>
Provision a Kafka cluster
supertubes cluster create -n kafka -c <path-to-k8s-cluster-kubeconfig-file> -f https://raw.githubusercontent.com/banzaicloud/koperator/master/config/samples/simplekafkacluster_ebs_csi.yaml
supertubes cluster get -n kafka --kafka-cluster kafka -c <path-to-k8s-cluster-kubeconfig-file>
Now we'll create a topic with one partition and a replication factor of one.
supertubes cluster topic create -n kafka --kafka-cluster kafka -c <path-to-k8s-cluster-kubeconfig-file> -f- <<EOF
apiVersion: kafka.banzaicloud.io/v1alpha1
kind: KafkaTopic
metadata:
name: my-topic
namespace: kafka
spec:
name: my-topic
partitions: 1
replicationFactor: 1
config:
"retention.ms": "28800000"
"cleanup.policy": "delete"
EOF
kubectl apply -n kafka -f- <<EOF
apiVersion: v1
kind: Pod
metadata:
name: kafkacat
spec:
containers:
- name: kafka-test
image: "solsson/kafkacat:alpine"
# Just spin & wait forever
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 3000; done;" ]
EOF
kubectl exec -n kafka -it kafkacat bash
kafkacat -b kafka-all-broker:29092 -P -t my-topic
hello
there
Verify that we can read back the messages, and which broker is the partition leader for my-topic
.
kafkacat -b kafka-all-broker:29092 -C -t my-topic -c2
hello
there
kafkacat -b kafka-all-broker:29092 -L -t my-topic
Metadata for my-topic(from broker -1: kafka-all-broker:29092/bootstrap):
3 brokers:
broker 0 at kafka-0.kafka.svc.cluster.local:29092(controller)
broker 2 at kafka-2.kafka.svc.cluster.local:29092 broker 1 at kafka-1.kafka.svc.cluster.local:29092
1 topics:
topic "my-topic" with 1 partitions:
partition 0, leader 2, replicas: 2, isrs: 2
Broker 2
is the partition leader.
supertubes cluster backup create -n kafka --kafka-cluster kafka --snapshot-class csi-aws-vsc --backup-name test-backup --schedule "@every 15m" --retention-time "12h" --starting-deadline-seconds 300 -c <path-to-k8s-cluster-kubeconfig-file>
Wait fifteen minutes until the first backup is executed, and verify that the volume snapshots were created, particularly of broker 2
, which is the partition leader for our single partition topic my-topic
.
kubectl get volumesnapshot -n kafka -l brokerId=2
NAME AGE
kafka-2-storagetngcc-20200221-190533 26m
kafka-2-storagetngcc-20200221-192033 11m
To simulate the complete loss of a broker, we'll first stop the operator so that we can delete the broker 2
pod and its attached volumes.
kubectl scale deployment -n kafka kafka-operator-operator --replicas=0
kubectl delete pod -n kafka -lkafka_cr=kafka,brokerId=2
kubectl delete pvc -n kafka -lkafka_cr=kafka,brokerId=2
Now let's check the status of the my-topic
topic in Kafka.
kafkacat -b kafka-all-broker:29092 -L -t my-topic
Metadata for my-topic(from broker -1: kafka-all-broker:29092/bootstrap):
2 brokers:
broker 0 at kafka-0.kafka.svc.cluster.local:29092 (controller)
broker 1 at kafka-1.kafka.svc.cluster.local:29092
1 topics: topic "my-topic" with 1 partitions: partition 0, leader -1, replicas: 2, isrs: 2, Broker: Leader not available
As we can see, there are no partition leader brokers for my-topic
also no messages are returned by kafkacat from this topic.
Start the operator, which will note that broker 2 is missing and will therefore create a new broker pod. Because Supertubes is aware of the backups created for broker 2, it will seamlessly restore the volume for the new broker 2 instance from the volume snapshot - no user intervention required.
kubectl scale deployment -n kafka kafka-operator-operator --replicas=1
Wait for the new broker 2 pod to come up and check my-topic
again.
kafkacat -b kafka-all-broker:29092 -C -t my-topic -c2
hello
there
Banzai Cloud Supertubes (Supertubes) is the automation tool for setting up and operating production-ready Kafka clusters on Kubernetes, leveraging a Cloud-Native technology stack. Supertubes includes Zookeeper, the Banzai Cloud Kafka operator, Envoy, Istio and many other components that are installed, configured, and managed to operate a production-ready Kafka cluster on Kubernetes. Some of the key features are fine-grained broker configuration, scaling with rebalancing, graceful rolling upgrades, alert-based graceful scaling, monitoring, out-of-the-box mTLS with automatic certificate renewal, Kubernetes RBAC integration with Kafka ACLs, and multiple options for disaster recovery.
Get emerging insights on innovative technology straight to your inbox.
Discover how AI assistants can revolutionize your business, from automating routine tasks and improving employee productivity to delivering personalized customer experiences and bridging the AI skills gap.
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.