Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
INSIGHTS
3 min read
Share
Apache Spark on Kubernetes series: Introduction to Spark on Kubernetes Scaling Spark made simple on Kubernetes The anatomy of Spark applications on Kubernetes Monitoring Apache Spark with Prometheus Spark History Server on Kubernetes Spark scheduling on Kubernetes demystified Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Apache Spark application resilience on Kubernetes
Apache Zeppelin on Kubernetes series: Running Zeppelin Spark notebooks on Kubernetes Running Zeppelin Spark notebooks on Kubernetes - deep dive
Apache Kafka on Kubernetes series: Kafka on Kubernetes - using etcd
Whether you deploy a Spark application on Kubernetes with or without Pipeline, you may want to keep the application's logs after it’s finished. Spark Driver
keeps event logs while running, but after a Spark application is finished Spark Driver
exits, so these are lost unless you enable event logging and set a folder where the logs are placed. One option is to start Spark History Server
, and point it to the same log directory so you'll be able to reach your application logs post-execution. Just remember, Spark History Server
is the component/web UI that tracks completed and running Spark applications. It's an extension of Spark’s web UI. The most straight forward way of accomplishing this is to set up a distributed shared folder as a log directory, for example EFS, or to use a distributed (object) storage like S3
(if you're using Amazon
) or Azure Blob Storage
(if you're using Azure). For this example let's use Amazon's S3 and follow up on EFS in the next post in this series.
Spark History Server
in the cloud (AWS or AKS) using S3, EFS or Azure Blob storageSpark History Server
helm install --set app.logDirectory=s3a://yourBucketName/eventLogFoloder .
If you are interested in the details of how to set up Spark History Server
on Kubernetes and store the logs on S3, read on.
You will need the following Config params for spark-submit
:
--conf spark.eventLog.dir=s3a://yourBucketName/eventLogFoloder
--conf spark.hadoop.fs.s3a.access.key=XXXXXXXXXXXXXX
--conf spark.hadoop.fs.s3a.secret.key=XXXXXXXXXXXXXXXXXXXXXXXXX
--conf spark.eventLog.enabled=true
--packages org.apache.hadoop:hadoop-aws:2.6.5
--exclude-packages org.apache.hadoop:hadoop-common,com.fasterxml.jackson.core:jackson-databind,com.fasterxml.jackson.core:jackson-annotations,com.fasterxml.jackson.core:jackson-core,org.apache.httpcomponents:httpclient,org.apache.httpcomponents:httpcore
Notes:
Config params to pass to Spark History Server
:
Dspark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
-Dspark.history.fs.logDirectory=s3a://yourBucketName/eventLogFoloder
-Dspark.hadoop.fs.s3a.access.key=XXXXXXXXXXXXXX
-Dspark.hadoop.fs.s3a.secret.key=XXXXXXXXXXXXXXXXXXXXXXXXX
To start Spark History Server
on Kubernetes, use our open source Helm chart, in which you can pass the app.logDirectory
value as a param for the Helm tool: helm install --set app.logDirectory=s3a://yourBucketName/eventLogFoloder spark-hs
. Note that Pipeline has added and open sourced a feature for Helm that deploys applications, not just via gRPC code or the out-of-the-box Helm CLI tool, but by using a RESTful API as well.
Get emerging insights on innovative technology straight to your inbox.
Discover how AI assistants can revolutionize your business, from automating routine tasks and improving employee productivity to delivering personalized customer experiences and bridging the AI skills gap.
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.