IN-DEPTH TECH

3 min read

by

Sandor Magyari

Published on 01/23/2018

Last updated on 06/17/2025

Published on 01/23/2018

Last updated on 06/17/2025

Spark application logs - History Server setup on Kubernetes

Subscribe to

The Shift!

Get emerging insights on innovative technology straight to your inbox.

Apache Spark on Kubernetes series: Introduction to Spark on Kubernetes Scaling Spark made simple on Kubernetes The anatomy of Spark applications on Kubernetes Monitoring Apache Spark with Prometheus Spark History Server on Kubernetes Spark scheduling on Kubernetes demystified Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series: Running Zeppelin Spark notebooks on Kubernetes Running Zeppelin Spark notebooks on Kubernetes - deep dive

Apache Kafka on Kubernetes series: Kafka on Kubernetes - using etcd

Whether you deploy a Spark application on Kubernetes with or without Pipeline, you may want to keep the application's logs after it’s finished. Spark Driver keeps event logs while running, but after a Spark application is finished Spark Driver exits, so these are lost unless you enable event logging and set a folder where the logs are placed. One option is to start Spark History Server, and point it to the same log directory so you'll be able to reach your application logs post-execution. Just remember, Spark History Server is the component/web UI that tracks completed and running Spark applications. It's an extension of Spark’s web UI. The most straight forward way of accomplishing this is to set up a distributed shared folder as a log directory, for example EFS, or to use a distributed (object) storage like S3 (if you're using Amazon) or Azure Blob Storage (if you're using Azure). For this example let's use Amazon's S3 and follow up on EFS in the next post in this series.

tl;dr:

Provision Spark History Server in the cloud (AWS or AKS) using S3, EFS or Azure Blob storage
Get the open sourced Kubernetes Helm chart for Spark History Server
Use helm install --set app.logDirectory=s3a://yourBucketName/eventLogFoloder .

If you are interested in the details of how to set up Spark History Server on Kubernetes and store the logs on S3, read on.

Set up Event logging using AWS S3

Spark Submit configurations

You will need the following Config params for spark-submit:

--conf spark.eventLog.dir=s3a://yourBucketName/eventLogFoloder
    --conf spark.hadoop.fs.s3a.access.key=XXXXXXXXXXXXXX
    --conf spark.hadoop.fs.s3a.secret.key=XXXXXXXXXXXXXXXXXXXXXXXXX
    --conf spark.eventLog.enabled=true
    --packages org.apache.hadoop:hadoop-aws:2.6.5
    --exclude-packages org.apache.hadoop:hadoop-common,com.fasterxml.jackson.core:jackson-databind,com.fasterxml.jackson.core:jackson-annotations,com.fasterxml.jackson.core:jackson-core,org.apache.httpcomponents:httpclient,org.apache.httpcomponents:httpcore

Notes:

we've recently added Vault to Pipeline, so all cloud related credentials will be stored there and passed into configurations
you need to have already created an S3 bucket
you can omit AWS credentials if you include these policies in your IAM role: "s3:ListBucket", "s3:GetObject", "s3:PutObject", "s3:ListObjects", "s3:DeleteObject” like we do in Pipeline
Spark accesses S3 and WASB through the HDFS protocol, so you'll need the Hadoop client and related AWS S3/Azure client jars to be available. Instead of specifying packages, here, you can include these in the Spark Kubernetes image. Our Docker images - banzaicloud/spark-driver:v2.2.0-k8s-1.0.197 and banzaicloud/spark-history-server:v2.2.0-k8s-1.0.197 include both the S3 and Azure dependencies.

Spark History Server configuration

Config params to pass to Spark History Server:

Dspark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
-Dspark.history.fs.logDirectory=s3a://yourBucketName/eventLogFoloder
-Dspark.hadoop.fs.s3a.access.key=XXXXXXXXXXXXXX
-Dspark.hadoop.fs.s3a.secret.key=XXXXXXXXXXXXXXXXXXXXXXXXX

To start Spark History Server on Kubernetes, use our open source Helm chart, in which you can pass the app.logDirectory value as a param for the Helm tool: helm install --set app.logDirectory=s3a://yourBucketName/eventLogFoloder spark-hs. Note that Pipeline has added and open sourced a feature for Helm that deploys applications, not just via gRPC code or the out-of-the-box Helm CLI tool, but by using a RESTful API as well.

Subscribe to

The Shift!

Get emerging insights on innovative technology straight to your inbox.

Welcome to the future of agentic AI: The Internet of Agents

Outshift is leading the way in building an open, interoperable, agent-first, quantum-safe infrastructure for the future of artificial intelligence.

* No email required

Twitter

Facebook

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

Twitter

Facebook

Apache Spark on Kubernetes series: Introduction to Spark on Kubernetes Scaling Spark made simple on Kubernetes The anatomy of Spark applications on Kubernetes Monitoring Apache Spark with Prometheus Spark History Server on Kubernetes Spark scheduling on Kubernetes demystified Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series: Running Zeppelin Spark notebooks on Kubernetes Running Zeppelin Spark notebooks on Kubernetes - deep dive

Apache Kafka on Kubernetes series: Kafka on Kubernetes - using etcd

tl;dr:

Provision Spark History Server in the cloud (AWS or AKS) using S3, EFS or Azure Blob storage
Get the open sourced Kubernetes Helm chart for Spark History Server
Use helm install --set app.logDirectory=s3a://yourBucketName/eventLogFoloder .

If you are interested in the details of how to set up Spark History Server on Kubernetes and store the logs on S3, read on.

Set up Event logging using AWS S3

Spark Submit configurations

You will need the following Config params for spark-submit:

--conf spark.eventLog.dir=s3a://yourBucketName/eventLogFoloder
    --conf spark.hadoop.fs.s3a.access.key=XXXXXXXXXXXXXX
    --conf spark.hadoop.fs.s3a.secret.key=XXXXXXXXXXXXXXXXXXXXXXXXX
    --conf spark.eventLog.enabled=true
    --packages org.apache.hadoop:hadoop-aws:2.6.5
    --exclude-packages org.apache.hadoop:hadoop-common,com.fasterxml.jackson.core:jackson-databind,com.fasterxml.jackson.core:jackson-annotations,com.fasterxml.jackson.core:jackson-core,org.apache.httpcomponents:httpclient,org.apache.httpcomponents:httpcore

Notes:

we've recently added Vault to Pipeline, so all cloud related credentials will be stored there and passed into configurations
you need to have already created an S3 bucket
you can omit AWS credentials if you include these policies in your IAM role: "s3:ListBucket", "s3:GetObject", "s3:PutObject", "s3:ListObjects", "s3:DeleteObject” like we do in Pipeline
Spark accesses S3 and WASB through the HDFS protocol, so you'll need the Hadoop client and related AWS S3/Azure client jars to be available. Instead of specifying packages, here, you can include these in the Spark Kubernetes image. Our Docker images - banzaicloud/spark-driver:v2.2.0-k8s-1.0.197 and banzaicloud/spark-history-server:v2.2.0-k8s-1.0.197 include both the S3 and Azure dependencies.

Spark History Server configuration

Config params to pass to Spark History Server:

Dspark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
-Dspark.history.fs.logDirectory=s3a://yourBucketName/eventLogFoloder
-Dspark.hadoop.fs.s3a.access.key=XXXXXXXXXXXXXX
-Dspark.hadoop.fs.s3a.secret.key=XXXXXXXXXXXXXXXXXXXXXXXXX

by

Sandor Magyari

Published on 01/23/2018

Last updated on 06/17/2025

Published on 01/23/2018

Last updated on 06/17/2025

Spark application logs - History Server setup on Kubernetes

Get emerging insights on innovative technology straight to your inbox.

tl;dr:

Set up Event logging using AWS S3

Spark Submit configurations

Spark History Server configuration

Welcome to the future of agentic AI: The Internet of Agents

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

by

Sandor Magyari

Published on 01/23/2018

Last updated on 06/17/2025

Published on 01/23/2018

Last updated on 06/17/2025

Spark application logs - History Server setup on Kubernetes

Get emerging insights on innovative technology straight to your inbox.

tl;dr:

Set up Event logging using AWS S3

Spark Submit configurations

Spark History Server configuration

Welcome to the future of agentic AI: The Internet of Agents

Related articles

Quantum

Outshift QRNG: True Randomness Now API-Accessible

AI/ML

Building multi-agentic systems with AGNTCY's Application SDK and reference application

AI/ML

Transform AI performance with agent observability and evaluation