Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
INSIGHTS
8 min read
Share
If you are looking to try out an automated way to provision and manage Kafka on Kubernetes, please follow this Kafka on Kubernetes the easy way link.
At Banzai Cloud we use Kafka internally a lot. We have some internal systems and customer reporting deployments where we rely heavily on Kafka deployed to Kubernetes. We practice what we preach and all these deployments (not just the external ones) are done using our application platform, Pipeline. There is one difference between regular Kafka deployments and ours (though it is not relevant to this post): we have removed Zookeeper and use etcd instead.
Some of our older posts about Apache Kafka on Kuberbetes: Kafka on Kubernetes - using etcd Monitoring Apache Kafka with Prometheus Kafka on Kubernetes with Local Persistent Volumes Kafka on Kubernetes the easy way
We would like to contribute our changes back into the official Kafka codebase so we have opened a ticket KAFKA-6598 and a KIP KIP-273. If you want to chip in, follow along on the official Kafka mailing list.
For a quick overview of Kafka on Kubernetes - using etcd see the diagram below: In today's blog we're not going to go into detail about how we added support to etcd because we already have a post about that. This post will talk about how to deploy Kafka on top of Kubernetes with Pipeline using Local Persistent Volumes - a beta feature in Kubernetes 1.10.
If you are interested in our previous posts on the subject, read:
Kafka on Kubernetes - using etcd Monitoring Apache Kafka with Prometheus
Later on in this blog, we're going to talk about Kubernetes Persistent Volume related resources, so, if you're not familiar with those resources, please read our earlier blog post about it.
Local Persistent Volume is a beta feature in Kubernetes 1.10. It was created to leverage local disks and it enables their use with Persistent Volume Claims, PVC
. This type of storage is suitable for applications that handle data replication. Kafka is an application with its own data replication, so Local Persistent Volume is a perfect fit. Moreover, because of locally attached SSD's, we can hope for a better performance than from alternatively managed storage.
Note: there are some systems we use or deploy for our customers which already handle replications - like HDFS, Cassandra, etc - and Local Persistent Volume is a good fit for those.
To use Persistent Volume, there are some manual steps to go through first:
In later Kubernetes releases, these steps will be omitted.
Because Kubernetes already has a feature called hostPath which allows us to use local disk as a storage for Pods, you may ask, 'Why should I use Local Persistent Volume instead?'' When using hostPath the storage path has to be set inside the Pod descriptor. On the other hand, when Local Persistent Volume is used, storage can be preserved through a Persistent Volume Claim, so the storage path is not encoded directly to the Pod spec.
If you've chosen to use Pipeline to create a Kubernetes cluster on any of its supported cloud providers, you need a control plane. For a control plane you'll need a GitHub account and an OAuth App, the latter of which must be setup upfront, since OAuth is used for authentication. Once the prerequisites are satisfied, follow the instructions described in this blog to spin up a Kubernetes cluster.
Please note,
Local Persistent Volume
only works when Kubernetes is version 1.10+.
When using Pipeline there are two ways of provisioning a Kafka cluster.
Below, we're going to use Pipeline to deploy Kafka - in this case with Helm over a RESTful API, using our open source Kafka charts. Since Pipeline only accepts requests from authenticated users, users must authenticate first, then acquire a token that will be used in subsequent curl
requests.
If you want formatted results install/use jq
First, we need to determine the id
of the cluster we intend to deploy Kafka to from a list of the clusters managed by Pipeline. The API call to the list of managed Kubernetes clusters is:
curl -g --request GET \
--url 'http://{{url}}/api/v1/orgs/{{orgId}}/clusters' \
--header 'Authorization: Bearer {{token}}' | jq
[ { "status": "RUNNING",
"status_message": "Cluster is running", "name":
"gkecluster-baluchicken-792", "location": "europe-west1-b",
"cloud": "google", "id": 1, "nodePools": { "pool1": {
"count": 3, "instance_type": "n1-standard-2" } } } ]
To interact with the Kubernetes Cluster with the standard kubectl
, let's download the Kubernetes config.
curl -g --request GET \
--url
'http://{{url}}/api/v1/orgs/{{orgId}}/clusters/{{cluster_id}}/config'
\
--header 'Authorization: Bearer {{token}}'
Save the Kube config to a file in your computer then point the environment variable KUBECONFIG
to that file.
export
KUBECONFIG=/Users/baluchicken/Downloads/kubi.conf
As mentioned above, we need to do some manual configurations to get Local Persistent Volume working. We're going to use the local volume static provisioner
, which will create the Persistent Volume. In a three node cluster, where each node has one SSD attached, this DaemonSet
will create three individual Persistent Volumes
. The code listed below also contains a ConfigMap
, which tells the provisioner where to look for mounted disks. When using Google, all SSDs are mounted under /mnt/disks/ssdx
.
kubectl create -f - <<EOF apiVersion:
v1 kind: ConfigMap metadata: name: local-provisioner-config
namespace: default data: storageClassMap: | local-scsi:
hostDir: /mnt/disks mountDir: /mnt/disks
---
apiVersion: extensions/v1beta1 kind: DaemonSet metadata:
name: local-volume-provisioner namespace: default labels:
app: local-volume-provisioner spec: selector: matchLabels:
app: local-volume-provisioner template: metadata: labels:
app: local-volume-provisioner spec: containers: - image:
"quay.io/external_storage/local-volume-provisioner:v2.1.0"
imagePullPolicy: "Always" name: provisioner securityContext:
privileged: true env: - name: MY_NODE_NAME valueFrom:
fieldRef: fieldPath: spec.nodeName volumeMounts: -
mountPath: /etc/provisioner/config name: provisioner-config
readOnly: true - mountPath: /mnt/disks name: local-scsi
volumes: - name: provisioner-config configMap: name:
local-provisioner-config - name: local-scsi hostPath: path:
/mnt/disks EOF
If everything's gone according to plan, there should be as many Persistent Volumes as nodes in your cluster.
kubectl get pv NAME CAPACITY ACCESS
MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
local-pv-4ba80b32 368Gi RWO Delete Available local-scsi 19s
local-pv-7cca15bd 368Gi RWO Delete Available local-scsi 19s
local-pv-dbce8b03 368Gi RWO Delete Available local-scsi 19s
In our last step we will create a Storage Class
, which will handle all requests coming from Persistent Volume Claims
. This class will be unique insofar as it will delay PVC binding until the Pod is scheduled, so it can choose the appropriate local storage. To achieve this, StorageClass must contains this field: volumeBindingMode: "WaitForFirstConsumer"
<code class="language-bash hljs">kubectl create -f - <<<span class="hljs-string">EOF
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata:
name: "local-scsi" provisioner:
"kubernetes.io/no-provisioner" volumeBindingMode:
"WaitForFirstConsumer" EOF</span>
</code>
Now Local Persistent Volume
is all set up.
Since we created a new Kafka Helm chart just for this blog, we need to add a new Helm repository to the Pipeline.
curl -g --request PUT \
--url
'{{url}}/api/v1/orgs/{{orgId}}/clusters/{{cluster_id}}/helm/repos/banzaicloud-stable'
\
--header 'Authorization: Bearer {{token}}' \
--header 'Content-Type: application/json' \
-d '{ "name": "banzaicloud-stable", "cache":
"statestore/<your-cluster>/helm/repository/cache/banzaicloud-stable-index.yaml",
"url":
"kubernetes-charts-incubator.banzaicloud.com/branch/kafka-local-volume",
"username": "", "password": "", "certFile": "", "keyFile":
"", "caFile": "" }' | jq
{ "status": 200, "message": "resource
modified successfully", "name": "banzaicloud-stable" }
curl -g --request PUT \
--url
'{{url}}/api/v1/orgs/{{orgId}}/clusters/{{cluster_id}}/helm/repos/banzaicloud-stable/update'
\
--header 'Authorization: Bearer {{token}}' \
--header 'Content-Type: application/json'
{ "status": 200, "message": "repository updated
successfully", "name": "banzaicloud-stable" }
Next we can submit the Kafka deployment to the cluster:
curl -g --request POST \
--url
'http://{{url}}/api/v1/orgs/{{orgId}}/clusters/{{cluster_id}}/deployments'
\
--header 'Authorization: Bearer {{token}}' \
--header 'Content-Type: application/json' \
-d '{"name": "banzaicloud-stable/kafka"}' | jq
{ "release_name": "kindly-macaw",
"notes": "" }
Behind the scenes we're using CoreOS's etcd-Operator to install the etcd cluster Kafka requires. If everything went well, you will see that Kafka with etcd is deployed to your Kubernetes cluster.
kubectl get pods NAME
READY STATUS RESTARTS AGE etcd-cluster-0000 1/1 Running 0 6m
etcd-cluster-0001 1/1 Running 0 6m etcd-cluster-0002 1/1
Running 0 6m kafka-0 1/1 Running 0 7m kafka-1 1/1 Running 0
6m kafka-2 1/1 Running 0 5m
kindly-macaw-etcd-operator-65bdcfc5d7-mqvfm 1/1 Running 0 7m
pipeline-traefik-7c47dc7bd7-hntqp 1/1 Running 0 8m
We already have a blogpost about how to run a simple Kafka app inside the cluster; please follow this link.
Meanwhile, we've released a more recent version of etcd backed Kafka. Its image name is:
banzaicloud/kafka:2.12-2.0.0-etcd-0.0.3
That's it, you can now deploy Kafka on Kubernetes with the use of a few simple steps and this new Local Persistent Volume Kubernetes feature. If you're interested in more Kafka on Kubernetes with etcd, stay turned. We'll be making our operator open source once we have fully added our observability
features (we've already centralized log collection and federated monitoring for all our deployments).
Get emerging insights on innovative technology straight to your inbox.
Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.