IN-DEPTH TECH

6 min read

by

Sandor Magyari

Published on 12/04/2017

Last updated on 02/12/2025

Published on 12/04/2017

Last updated on 02/12/2025

Running Zeppelin Spark notebooks on Kubernetes

Subscribe to

The Shift!

Get emerging insights on innovative technology straight to your inbox.

Apache Spark on Kubernetes series: Introduction to Spark on Kubernetes Scaling Spark made simple on Kubernetes The anatomy of Spark applications on Kubernetes Monitoring Apache Spark with Prometheus Spark History Server on Kubernetes Spark scheduling on Kubernetes demystified Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series: Running Zeppelin Spark notebooks on Kubernetes Running Zeppelin Spark notebooks on Kubernetes - deep dive

Apache Kafka on Kubernetes series: Kafka on Kubernetes - using etcd

In the last post in this series we introduced Spark on Kubernetes and demonstrated a simple approach to scaling a Spark cluster on Kubernetes, the cloud native way. We're building a platform as a service optimized to run big data workloads (and a multitude of microservices) called Pipeline, and we're adding native support for running Apache Zeppelin notebooks on Kubernetes. Apache Zeppelin is a web-based notebook that enables data-driven interactive data analytics, provides built-in integration for Apache Spark, and has about five different interpreters at its disposal to execute Scala, Python, R and SQL code on Spark. These are remotely initiated, and created and managed by RemoteInterpreterServer. To run RemoteInterpreterServer, Zeppelin uses the well known Spark tool, spark-submit. This tool starts by default on every machine Zeppelin runs, consequently starting Spark in embedded mode. Also, you can run Spark on a Yarn cluster in both client and cluster mode. Since Kubernetes is growing more and more popular and is a de-facto standard for enterprise level orchestration, it makes more and more sense to start your Spark notebooks on Kubernetes the native way. There's already a working version of Spark on Kubernetes, in which Spark jobs are started with spark-submit. Because that same tool is used by Zeppelin by default, all we have to do is pass the right parameters. Easy peasy. Unfortunately - as always - it's not easy peasy. First, let's go over our chief options insofar as Zeppelin deployment models that run on Kubernetes.

A Zeppelin server running outside or inside a k8s cluster
spark-submit started with deployMode = client/cluster

Spark-submit deployMode determines where the driver process runs. In cluster mode, the framework launches drivers inside the cluster. In client mode, the submitter launches drivers outside of the cluster. Client mode is so far unsupported. PR-456 in-cluster client mode let's you specify your deployMode as client. However, this is not truly client mode, but the misleadingly titled in-cluster client mode, because it only works if you're running spark-submit inside a pod. The other mode is cluster mode, wherein Spark Driver always runs inside the k8s cluster, but in which the Zeppelin Server can run both inside and outside the cluster. In such cases, the flow of actions is identical to how it's presented in the Spark Flow blog post, with the exception that the actor initiating spark-submit is the Zeppelin Server itself. Let's start a notebook in both modes to see a list of running pods. Below is a list of pods for in-cluster client mode:

<pre class="language-unknown"><code class="language-unknown">NAME                                                 READY     STATUS    RESTARTS   AGE
spark-resource-staging-server-55d5d8744b-7r89s       1/1       Running   2          15d
zeppelin-exec-1                                      1/1       Running   0          2d
zeppelin-exec-2                                      1/1       Running   0          2d
zeppelin-server                                      1/1       Running   0          2d</code></pre>

Here's list of pods for cluster mode:

NAME                                                 READY     STATUS    RESTARTS   AGE
spark-resource-staging-server-55d5d8744b-7r89s       1/1       Running   2          15d
zeppelin-2cxx5tw2h--2a94m5j1z-1511971765863-driver   1/1       Running   0          2d
zeppelin-2cxx5tw2h--2a94m5j1z-1511971765863-exec-1   1/1       Running   0          2d
zeppelin-2cxx5tw2h--2a94m5j1z-1511971765863-exec-2   1/1       Running   0          2d
zeppelin-server                                      1/1       Running   0          2d

As you can see, in cluster mode, we have a separate pod for the Spark driver, whereas in-cluster client mode runs that driver inside the zeppelin-server pod. After experimenting with these modes, we have found that cluster mode has a couple of advantages over in-cluster client mode. Let's take a look:

running Zeppelin Server and each RemoteInterpreterServer process (Spark Driver) in separate pods better suits Kubernetes best practices/patterns, instead of having one monolithic RemoteInterpreterServer. The latest Spark Driver creates a separate k8s Service to handle Executor --> Driver connections in cluster mode, which, again, better fit with Kubernetes best practices/patterns
cluster mode works even if Zeppelin Server in not part of the K8S cluster
in-cluster client mode appears to have some problems at the moment, like executor naming, which doesn't contain any prefixes and causes problems when running multiple interpreters on the same cluster.

All and all, choosing cluster mode seems to be a cleaner approach that better fits with the k8s ecosystem and, at the same time, has no side-effects for those unwilling to use Kubernetes. A couple of problems we ran into:

communication between Zeppelin Server and RemoteInterpreterServer happens via a callback mechanism: CallbackServer is started by Zeppelin Server then passed to RemoteInterpreterServer. After being started by Spark, RemoteInterpreterServer connects to this CallbackServer and sends its connection details. Directly connecting pods to each other may cause problems in a Kubernetes cluster, the preferred method is to interconnect different processes through services.
the current mechanism that sets up dependencies for Spark interpreter in interpreter settings only works if Spark is running in embedded mode
no separate log4j config as is the case for YARN
if Zeppelin Server is unable to connect or loses its connection to RemoteInterpreterServer due to a k8s specific problem, the Driver pod remains there forever

We are working and contributing pull requests to address the issues above, and our PaaS, Pipeline, already uses a patched version of Zeppelin with support for them. If you're interested in more technical details, check back in order to read the next post in this series, which will be a more in-depth walkthrough of the solution we intend to use, and will elaborate on how to start a notebook on Kubernetes using our opensource and prebuilt images. Please find a typical flow for Zeppelin notebooks running on a Kubernetes cluster, below.

Subscribe to

The Shift!

Get emerging insights on innovative technology straight to your inbox.

Welcome to the future of agentic AI: The Internet of Agents

Outshift is leading the way in building an open, interoperable, agent-first, quantum-safe infrastructure for the future of artificial intelligence.

* No email required

Twitter

Facebook

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

Twitter

Facebook

Apache Spark on Kubernetes series: Introduction to Spark on Kubernetes Scaling Spark made simple on Kubernetes The anatomy of Spark applications on Kubernetes Monitoring Apache Spark with Prometheus Spark History Server on Kubernetes Spark scheduling on Kubernetes demystified Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series: Running Zeppelin Spark notebooks on Kubernetes Running Zeppelin Spark notebooks on Kubernetes - deep dive

Apache Kafka on Kubernetes series: Kafka on Kubernetes - using etcd

A Zeppelin server running outside or inside a k8s cluster
spark-submit started with deployMode = client/cluster

<pre class="language-unknown"><code class="language-unknown">NAME                                                 READY     STATUS    RESTARTS   AGE
spark-resource-staging-server-55d5d8744b-7r89s       1/1       Running   2          15d
zeppelin-exec-1                                      1/1       Running   0          2d
zeppelin-exec-2                                      1/1       Running   0          2d
zeppelin-server                                      1/1       Running   0          2d</code></pre>

Here's list of pods for cluster mode:

NAME                                                 READY     STATUS    RESTARTS   AGE
spark-resource-staging-server-55d5d8744b-7r89s       1/1       Running   2          15d
zeppelin-2cxx5tw2h--2a94m5j1z-1511971765863-driver   1/1       Running   0          2d
zeppelin-2cxx5tw2h--2a94m5j1z-1511971765863-exec-1   1/1       Running   0          2d
zeppelin-2cxx5tw2h--2a94m5j1z-1511971765863-exec-2   1/1       Running   0          2d
zeppelin-server                                      1/1       Running   0          2d

running Zeppelin Server and each RemoteInterpreterServer process (Spark Driver) in separate pods better suits Kubernetes best practices/patterns, instead of having one monolithic RemoteInterpreterServer. The latest Spark Driver creates a separate k8s Service to handle Executor --> Driver connections in cluster mode, which, again, better fit with Kubernetes best practices/patterns
cluster mode works even if Zeppelin Server in not part of the K8S cluster
in-cluster client mode appears to have some problems at the moment, like executor naming, which doesn't contain any prefixes and causes problems when running multiple interpreters on the same cluster.

communication between Zeppelin Server and RemoteInterpreterServer happens via a callback mechanism: CallbackServer is started by Zeppelin Server then passed to RemoteInterpreterServer. After being started by Spark, RemoteInterpreterServer connects to this CallbackServer and sends its connection details. Directly connecting pods to each other may cause problems in a Kubernetes cluster, the preferred method is to interconnect different processes through services.
the current mechanism that sets up dependencies for Spark interpreter in interpreter settings only works if Spark is running in embedded mode
no separate log4j config as is the case for YARN
if Zeppelin Server is unable to connect or loses its connection to RemoteInterpreterServer due to a k8s specific problem, the Driver pod remains there forever

by

Sandor Magyari

Published on 12/04/2017

Last updated on 02/12/2025

Published on 12/04/2017

Last updated on 02/12/2025

Running Zeppelin Spark notebooks on Kubernetes

Get emerging insights on innovative technology straight to your inbox.

Welcome to the future of agentic AI: The Internet of Agents

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

Our Work

Our Collaborators

Company

Apply

Connect

Categories

Resource Hub

by

Sandor Magyari

Published on 12/04/2017

Last updated on 02/12/2025

Published on 12/04/2017

Last updated on 02/12/2025

Running Zeppelin Spark notebooks on Kubernetes

Get emerging insights on innovative technology straight to your inbox.

Welcome to the future of agentic AI: The Internet of Agents

Related articles

Strategy & Insights

KubeClarity cloud security solution: Runtime scanning

In-depth Tech

KubeClarity’s cloud security tools: Architecture deep dive

In-depth Tech

KubeClarity: Installation on AWS EKS architecture

by

Sandor Magyari

Published on 12/04/2017

Last updated on 02/12/2025

Published on 12/04/2017

Last updated on 02/12/2025

Running Zeppelin Spark notebooks on Kubernetes

Get emerging insights on innovative technology straight to your inbox.

Welcome to the future of agentic AI: The Internet of Agents

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

by

Sandor Magyari

Published on 12/04/2017

Last updated on 02/12/2025

Published on 12/04/2017

Last updated on 02/12/2025

Running Zeppelin Spark notebooks on Kubernetes

Get emerging insights on innovative technology straight to your inbox.

Welcome to the future of agentic AI: The Internet of Agents

Related articles

Strategy & Insights

KubeClarity cloud security solution: Runti­­­me scanning

In-depth Tech

KubeClarity’s cloud security tools: Architecture deep dive

In-depth Tech

KubeClarity: Installation on AWS EKS architecture

KubeClarity cloud security solution: Runtime scanning