IN-DEPTH TECH

4 min read

by

Sandor Magyari

Published on 04/15/2018

Last updated on 02/03/2025

Published on 04/15/2018

Last updated on 02/03/2025

Collecting Spark History Server event logs in the cloud

Subscribe to

The Shift!

Get emerging insights on innovative technology straight to your inbox.

Apache Spark on Kubernetes series: Introduction to Spark on Kubernetes Scaling Spark made simple on Kubernetes The anatomy of Spark applications on Kubernetes Monitoring Apache Spark with Prometheus Spark History Server on Kubernetes Spark scheduling on Kubernetes demystified Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series: Running Zeppelin Spark notebooks on Kubernetes Running Zeppelin Spark notebooks on Kubernetes - deep dive

Apache Kafka on Kubernetes series: Kafka on Kubernetes - using etcd

In our last blogpost we described how to configure spark-submit and Spark History Server to enable gathering event logs to Amazon S3. Since then, we've added more supported providers to Pipeline, and broadened the available options to easily capture Spark event logs to Amazon AWS S3, Microsoft Azure WASB and Google Cloud Storage. Lets see how this works. You can use our Helm deployment charts directly. We have the following umbrella charts:

spark deploys all the background infrastructure you need in place to run a spark-submit job: Spark Resource Staging Server, Shuffle Service and Spark History Server should be enabled (by default they're not).
zeppelin-spark deploys all of the above, plus a Zeppelin Server deployment and an externally accessible service.

If you want to experiment, you can find a few deployment examples, here.

Behind the scenes

Note: the following steps are automated by Pipeline, but are listed in order to aid in understanding what goes on behind the scenes, and to serve as a comprehensive guide, in case you'd like to reproduce them in your own environment without using Pipeline.

Let's see what's happening behind the scenes. If you prefer to do things manually, you'll need to resolve the following steps:

1. Configuration

Enable event logging in Spark Driver and configure the event logging folders for both spark-submit and History server. We've already thoroughly covered this topic in our previous blog

2. Building images

You need an image that includes Hadoop FileSystem drivers for each cloud storage option:

AWS libraries are included by default in Spark's distribution
Azure SDK can be included using the hadoop-2.7 profile
Google Connector, has to be included as a dependency to the hadoop-cloud module.

Currently, we build our Spark images based on Spark's k8s branch, since all of its features have yet to be ported to the master branch. You'll need a few patches to include Google Connector, let's see what these are:

SPARK-7481 introduces the spark-hadoop-cloud module, which is not present in Spark k8s and has to be cherry-picked from the master branch.
Add hadoop-cloud profile dependency. This is a necessary fix, since, by default, the spark-hadoop-cloud module is not included in the docker bundle.
Add gcs connector dependency to hadoop-cloud module. This includes the Google Connector dependency in the spark-hadoop-cloud module. It also updates Guava to a newer version in the docker bundle, as the current one is quite old.

We'll provide a patch to include an optional Google Connector for the master branch, as soon as these features are ported there, so we can use it as a basis for our Spark images.

3. Access to different Cloud Storages

Access is granted either by providing different access keys - this works on all cloud providers - or on the basis of policies/rules. Let's see what you need for setup on each cloud provider:

on Amazon it's possible to gain access to S3 storage using policies. For example, you can add the following policies to your instance profile:

{
  &quot;Version&quot;: &quot;2012-10-17&quot;,
  &quot;Statement&quot;: [
     {
        &quot;Effect&quot;: &quot;Allow&quot;,
        &quot;Action&quot;: [
            ...
	 &quot;s3:ListBucket&quot;,
	 &quot;s3:GetObject&quot;,
	 &quot;s3:PutObject&quot;,
	 &quot;s3:ListObjects&quot;,
	 &quot;s3:DeleteObject&quot;
        ],
        &quot;Resource&quot;: &quot;*&quot;
     }
  ]
}

on Google Cloud, if you create your bucket and cluster with the same Storage Account, the only thing you have to add is the following scopes to your node config:

Config: &amp;gke.NodeConfig{
   MachineType:    nodePoolModel.NodeInstanceType,
   ServiceAccount: nodePoolModel.ServiceAccount,
   OauthScopes: []string{
     ...
     &quot;https://www.googleapis.com/auth/devstorage.read_write&quot;,
   },
 },

on Azure there's no role-based access so far via the Hadoop FS connector, so you have to provide your Storage Account credentials, azureStorageAccountName and azureStorageAccessKey to History Server, and spark-submit options:
```
-Dspark.hadoop.fs.azure.account.key.{{ azureStorageAccountName
}}.blob.core.windows.net={{ azureStorageAccessKey }}
```

Keep in mind that these storage buckets have to be created before-hand. Pipeline automates those steps as well, utilizing a special Kubernetes operator that automatically creates buckets on any cloud provider.

Subscribe to

The Shift!

Get emerging insights on innovative technology straight to your inbox.

Welcome to the future of agentic AI: The Internet of Agents

Outshift is leading the way in building an open, interoperable, agent-first, quantum-safe infrastructure for the future of artificial intelligence.

* No email required

Inside Outshift

Cloud Unfiltered explores AI and cloud computing trends, platform engineering, and more

Cloud Native AI/ML

Inside Outshift

KubeCon + CloudNativeCon Europe 2024 wrap up: Two key themes and special editions of Cloud Unfiltered

Events Cloud Native

Security

Why integrating AI with graph-based technology is the future of cloud security

Cloud Native Panoptica Security

Subscribe  to

The Shift

Get

emerging insights

on innovative technology straight to your inbox.

The Shift is Outshift’s exclusive newsletter.

Get the latest news and updates on agentic AI, quantum, next-gen infra, and other groundbreaking innovations shaping the future of technology straight to your inbox.

by

Sandor Magyari

Published on 04/15/2018

Last updated on 02/03/2025

Published on 04/15/2018

Last updated on 02/03/2025

Collecting Spark History Server event logs in the cloud

Get emerging insights on innovative technology straight to your inbox.

Behind the scenes

1. Configuration

2. Building images

3. Access to different Cloud Storages

Welcome to the future of agentic AI: The Internet of Agents

Related articles

Inside Outshift

Cloud Unfiltered explores AI and cloud computing trends, platform engineering, and more

Inside Outshift

KubeCon + CloudNativeCon Europe 2024 wrap up: Two key themes and special editions of Cloud Unfiltered

Security

Why integrating AI with graph-based technology is the future of cloud security

Our Work

Our Collaborators

Company

Apply

Connect

Categories

Resource Hub