Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
INSIGHTS
4 min read
Share
Monitoring series: Monitoring Apache Spark with Prometheus Monitoring multiple federated clusters with Prometheus - the secure way Application monitoring with Prometheus and Pipeline Building a cloud cost management system on top of Prometheus Monitoring Spark with Prometheus, reloaded
At Banzai Cloud we deploy large distributed applications to Kubernetes clusters that we also operate. We don't enjoy waking up to PagerDuty notifications in the middle of the night, so we try to get ahead of problems by operating these clusters as efficiently as possible. We have out-of-the-box centralized log collection, end-to-end tracing, monitoring and alerts for all the spotguides we support, including applications, Kubernetes clusters and all the infrastructure we deploy. For monitoring, we choose Prometheus - the de-facto monitoring tool of cloud native applications. We've already explored our use through a series of articles on use cases of some advanced scenarios.
However, when monitor with Prometheus, there's a catch. Prometheus uses a pull
model in favor of http(s) to scrape data from applications. For batch jobs, it also supports a push
model, but enabling this feature requires a special component called pushgateway.
There's no out-of-the-box solution for monitoring Spark with Prometheus. We've created one but we're struggling to push the Prometheus sink upstream, into the Spark codebase. Therefore, we closed the PR and made the sink source available as a standalone, independent package.
Spark supports a few sinks out-of-the-box (Graphite, CSV, Ganglia), but Prometheus is not one of them, so we introduced a new Prometheus sink (here's the PR and related apache ticket, SPARK-22343). Long story, short, we've blogged about the original proposal before. However, the community proved unreceptive to the idea of adding a new industry standard monitoring solution. We closed our PR and now we're making the Apache Spark Prometheus Sink available the alternative way. Ideally, this should be part of Spark (as with the other sinks), but it also works fine from the classpath.
alternative
wayWe have externalized the sink into a separate library, which you can use by either building it yourself, or by taking it from our Maven repository.
PrometheusSink is a Spark metrics sink that publishes Spark metrics to Prometheus.
As previously mentioned, Prometheus uses a pull model over http(s) to scrape data from applications. For batch jobs it also supports a push model. We need to use this model, since Spark pushes metrics to sinks. In order to enable this feature for Prometheus, a special component called pushgateway must be running.
Spark publishes metrics to Sinks
listed in the metrics configuration file. The location of the metrics configuration file can be specified for spark-submit
as follows:
--conf spark.metrics.conf=<path_to_the_file>/metrics.properties
Add the following lines to the metrics configuration file:
# Enable Prometheus for all instances by class name
*.sink.prometheus.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
# Prometheus pushgateway address
*.sink.prometheus.pushgateway-address-protocol=<prometheus pushgateway protocol> - defaults to http
*.sink.prometheus.pushgateway-address=<prometheus pushgateway address> - defaults to 127.0.0.1:9091
*.sink.prometheus.period=<period> - defaults to 10
*.sink.prometheus.unit=< unit> - defaults to seconds (TimeUnit.SECONDS)
*.sink.prometheus.pushgateway-enable-timestamp=<enable/disable metrics timestamp> - defaults to false
pushgateway-address-protocol - the scheme of the URL where the pushgateway service is available
pushgateway-address - the host and port URL where the pushgateway service is available
period - controls the periodicity of metrics sent to pushgateway
unit - the time unit of that periodicity
pushgateway-enable-timestamp - enables sending the timestamp of those metrics sent to the pushgateway service. This is disabled by default as not all versions of the pushgateway service support timestamp for metrics.
spark-submit
needs to know the repository from which it downloads the jar
containing PrometheusSink:
--repositories https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases
Note: this is a Maven repo hosted on GitHub
Also, we have to specify the spark-metrics
package that includes PrometheusSink and it's dependent packages for spark-submit
:
--packages com.banzaicloud:spark-metrics_2.11:2.2.1-1.0.0,io.prometheus:simpleclient:0.0.23,io.prometheus:simpleclient_dropwizard:0.0.23,io.prometheus:simpleclient_pushgateway:0.0.23,io.dropwizard.metrics:metrics-core:3.1.2
The version number of the package is formatted as: com.banzaicloud:spark-metrics_<scala version>:<spark version>-<version>
This is not the ideal scenario but it does the job, and it's independent of the Spark codebase. At Banzai Cloud we're still hoping to contribute this sink once the community decides it actually needs it. Meanwhile, you can open new feature requests, use this existing PR, or ask for native
Prometheus support through one of our social media channels. As usual, we're happy to help. All the software we use and create is open source, so we're always eager to help make open source projects better.
If you'd like to contribute to making this available as an official Spark package, let us know.
Get emerging insights on innovative technology straight to your inbox.
Discover how AI assistants can revolutionize your business, from automating routine tasks and improving employee productivity to delivering personalized customer experiences and bridging the AI skills gap.
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.