Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
INSIGHTS
5 min read
Share
Apache Spark on Kubernetes series: Introduction to Spark on Kubernetes Scaling Spark made simple on Kubernetes The anatomy of Spark applications on Kubernetes Monitoring Apache Spark with Prometheus Spark History Server on Kubernetes Spark scheduling on Kubernetes demystified Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Apache Spark application resilience on Kubernetes
Apache Zeppelin on Kubernetes series: Running Zeppelin Spark notebooks on Kubernetes Running Zeppelin Spark notebooks on Kubernetes - deep dive
Apache Kafka on Kubernetes series: Kafka on Kubernetes - using etcd
This is the second blog post in the Spark on Kubernetes series, so I hope you'll bear with me as I recap a few items of interest from our only previous one. Containerization and cluster management technologies are constantly evolving within the cluster computing community. Apache Spark currently supports Apache Hadoop YARN and Apache Mesos, in addition to offering its own standalone cluster manager. In 2014, Google announced the development of Kubernetes which has its own feature set and differentiates itself from YARN and Mesos. In a nutshell, Kubernetes is an open source container orchestration framework that can run on local machines, on-premise or in the cloud. It supports multiple container frameworks (docker, rkt and clear containers), which allows users to select whichever they prefer. Although support for standalone Spark clusters on k8s currently exists, there are still major advantages to, and significant interest in, native execution support due to the limitations of standalone mode and the advantages of Kubernetes. This is what inspired the spark-on-k8s project, which we at Banzai Cloud are also contributing to, while simultaneously building our PaaS, Pipeline. Let’s take a look at a simple example of how to upscale Spark automatically
while it executes a basic SparkPI job that's running on EC2 instances.
NAME STATUS ROLES AGE VERSION
ip-10-0-100-40.eu-west-1.compute.internal Ready <none> 15m v1.8.3
No pods are running except a shuffle service.
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE
shuffle-lckjg 1/1 Running 0 19s 10.46.0.1 ip-10-0-100-40.eu-west-1.compute.internal
Let’s submit SparkPi with dynamic allocation enabled, and pass 50000 as an input parameter in order to see what’s going on in the k8s cluster.
bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://34.242.4.60:443 \
--kubernetes-namespace default \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.driver.cores="400m" \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.kubernetes.shuffle.namespace=default \
--conf spark.kubernetes.shuffle.labels="app=spark-shuffle-service,spark-version=2.2.0" \
--conf spark.local.dir=/tmp/spark-local \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=banzaicloud/spark-driver-py:v2.2.0-k8s-1.0.179 \
--conf spark.kubernetes.executor.docker.image=banzaicloud/spark-executor-py:v2.2.0-k8s-1.0.179 \
--conf spark.kubernetes.authenticate.submission.caCertFile=my-k8s-aws-ca.crt \
--conf spark.kubernetes.authenticate.submission.clientKeyFile=my-k8s-aws-client.key \
--conf spark.kubernetes.authenticate.submission.clientCertFile=my-k8s-aws-client.crt \
local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar 50000
$ kubectl get po
NAME READY STATUS RESTARTS AGE
spark-pi-1511535479452-driver 1/1 Running 0 2m
spark-pi-1511535479452-exec-1 0/1 Pending 0 38s
The driver is running inside the k8s cluster; so far so good. But, as we can see, only one executor is scheduled and it's not going anywhere. Let’s take a look at it:
$ kubectl describe po spark-pi-1511535479452-exec-1
...
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 11s (x8 over 1m) default-scheduler No nodes are available that match all of the predicates: Insufficient cpu (1), PodToleratesNodeTaints (1).
This warning tells us that there is no node in the k8s cluster with enough resources to run spark executors. After adding a new EC2 node and making more resources available in the k8s cluster executors
start running, since now, with the addition of the new node, there are enough resources.
NAME READY STATUS RESTARTS AGE
shuffle-lckjg 1/1 Running 0 10m
shuffle-ncprp 1/1 Running 0 47s
spark-pi-1511535479452-driver 1/1 Running 0 7m
spark-pi-1511535479452-exec-1 0/1 ContainerCreating 0 4m
NAME READY STATUS RESTARTS AGE
shuffle-lckjg 1/1 Running 0 11m
shuffle-ncprp 1/1 Running 0 56s
spark-pi-1511535479452-driver 1/1 Running 0 7m
spark-pi-1511535479452-exec-1 1/1 Running 0 5m
All up-scaling is handled transparently, as k8s began to automatically
execute pending executors as soon as the required resource became available. We didn’t have to do anything in Spark, no reconfiguration and no installation of components whatsoever. The above process still involves manual steps if you want to operate the cluster in a manual
way - however, our end goal and the aim of our Apache Spark Spotguide is to automate the whole process. Pipeline uses Prometheus to extrernalize monitoring information from cluster infrastructure, the cloud and the running Spark application, itself, then wires that information back to Kubernetes to anticipate and make upscales/downscales as practical as possible while maintaining predefined SLA rules. As usual, we've open sourced all the Spark images and charts necessary to run Spark natively on Kubernetes, and made them available in our Banzai Cloud GitHub repository. In our next post in this series we'll discuss the internals of Spark on Kubernetes. For a brief preview, check out this sequence diagram that highlights the internal flow of events and illustrates how a Spark cluster works with/inside Kubernetes. Sure, it might not be completely straightforward now, but rest assured that after finishing this series, the details will have been made crystal clear.
Get emerging insights on innovative technology straight to your inbox.
Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.