Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
PRODUCT
8 min read
Share
Cloud cost management series: Overspending in the cloud Managing spot instance clusters on Kubernetes with Hollowtrees Monitor AWS spot instance terminations Diversifying AWS auto-scaling groups Draining Kubernetes nodes Cluster recommender Cloud instance type and price information as a service
Last week we open sourced the Hollowtrees project, a framework that manages AWS spot instance clusters - batteries included:
In this post we’d like to take a deep dive into one of these core components - the workflow's trigger - the Prometheus Spot termination exporter. The exporter (as is the case with the other “batteries”) is a component that can be used independently of Pipeline and Hollowtrees.
EC2 provides a termination notice for instances that will be taken away. It's available on the internal instance metadata endpoint, two minutes before the instance shuts down. It's particularly useful if you want to execute a local cleanup script on the instance, or if you'd like to save a state to an external storage solution, like S3, but, in our case, we want to make the notice available outside the instance, in order to react to it in a way that takes the whole (Kubernetes) cluster state into account. Our open source spot instance termination exporter makes this information (and a few other AWS metrics) available outside of the EC2 instance.
In order to constantly provide the elasticity of on-demand instances, AWS must maintain a huge infrastructure with a lot of unused capacity. This unused capacity is basically the available spot instance pool; AWS lets users bid for these unused resources, usually at a significantly lower price than the on-demand price. It's not the goal of this post to go into a lot of detail about how spot instances and pricing works. If you're not familiar with these topics, read Amazon's primer. However, let's quickly recap how to request one (or more) spot instances and what their lifecycle looks like.
aws ec2 request-spot-instances \
--instance-count 1 \
--spot-price "0.3" \
--launch-specification "file://launch-spec.json"
aws ec2 run-instances \
--image-id ami-d834aba1 \
--count 1 \
--instance-type "m5.xlarge" \
--instance-market-options '{"MarketType":"Spot"}'
request-spot-instances
CLI command, you can specify block-duration-minutes
. The Spot Block model was introduced two years ago, and allows you to request instances for a finite duration: up to six hours. AWS guarantees that these instances won't be interrupted before their duration is up, but after this period is finished they will be marked for termination and receive a two minute termination notice, just like a standard spot instance.aws ec2 request-spot-instances \
--instance-count 1 \
--spot-price "0.3" \
--launch-specification "file://launch-spec.json" \
--block-duration-minutes 360
aws ec2 request-spot-fleet \
--spot-fleet-request-config "file://request-config.json"
A spot instance's Termination Notice is accessible to code running on the instance via the metadata at http://169.254.169.254/latest/meta-data/spot/termination-time
. This field is made available as soon as the instance is marked for termination, and contains when a shutdown signal is sent to the instance’s operating system. At that time, the Spot Instance Request’s bid status will be set to marked-for-termination.
The bid status is accessible via the DescribeSpotInstanceRequests
API, for use by programs that manage Spot bids and instances. The open source spot instance termination exporter makes the spot instance termination notice (and a few other AWS metrics) available outside of the instance.
This project uses the promu Prometheus utility tool. To build the exporter, promu
needs to be installed. To install promu and build the exporter:
go get github.com/prometheus/promu
promu build
The following options can be configured when starting the exporter:
./spot_expiry_exporter --help
Usage of ./spot_expiry_exporter:
-bind-addr string
bind address for the metrics server (default ":9189")
-log-level string
log level (default "info")
-metadata-endpoint string
metadata endpoint to query (default "http://169.254.169.254/latest/meta-data/")
-metrics-path string
path to metrics endpoint (default "/metrics")
The AWS instance metadata is available at http://169.254.169.254/latest/meta-data/
. By default this is the endpoint that is queried by the exporter, but it's very difficult to reproduce a termination notice on an AWS instance for testing, so the metadata endpoint can be changed via the configuration. There is a test server in the utils
directory that can be used to mock the behavior of the metadata endpoint. It listens on port 9092 and provides dummy responses for /instance-id
and /spot/instance-action
. You can start it with:
go run util/test_server.go
To query this endpoint locally, start the exporter with this configuration:
./spot_expiry_exporter --metadata-endpoint http://localhost:9092/latest/meta-data/ --log-level debug
# HELP aws_instance_metadata_service_available Metadata service available
# TYPE aws_instance_metadata_service_available gauge
aws_instance_metadata_service_available{instance_id="i-0d2aab13057917887"} 1
# HELP aws_instance_termination_imminent Instance is about to be terminated
# TYPE aws_instance_termination_imminent gauge
aws_instance_termination_imminent{instance_action="stop",instance_id="i-0d2aab13057917887"} 1
# HELP aws_instance_termination_in Instance will be terminated in
# TYPE aws_instance_termination_in gauge
aws_instance_termination_in{instance_id="i-0d2aab13057917887"} 119.888545
At Banzai Cloud all our deployments run in Kubernetes. We use the standard Helm package manager but all our deployments use Pipeline. We've made Helm deployments available over a RESTful API, as well. The charts are available at our GitHub charts repository. To install the exporter's chart with release name spot-exporter
:
$ helm install --name spot-exporter banzaicloud-incubator/spot-termination-exporter
Note that this exporter is like the standard Prometheus node exporter in that it is designed to run on every (spot) node; it monitors the internal metadata endpoint that is available from the node itself, so this chart deploys a Kubernetes DaemonSet that automatically runs a copy of the pod on every node.
While receiving a spot termination notice via Prometheus is better than nothing, the two minute timeframe is still a very short window in which to properly handle everything when draining and removing a node from a cluster. In an ideal world we woud know beforehand that a spot instance is being terminated - not with two minutes to spare, but two hours. As mentioned by AWS a few months ago, Amazon is moving to a new pricing model that delivers low, predictable prices that adjust gradually, based on long-term trends in supply and demand. That means that, hopefully, there will be fewer sudden surges in spot prices - therefore, fewer interruptions by EC2 - and it will consequently be a bit easier to predict prices. So, in the future, we intend to move to a different approach that swaps spot instances in clusters with different instance types or in different availability zones, even before the arrivalof of spot termination notices. Of course there will always be interruptions, because it's impossible to predict what will happen one hundred percent of the time, and when a cluster fails it will still be necessary to know how to react, so this project will still be part of all our spot instance-based deployments.
Get emerging insights on innovative technology straight to your inbox.
Discover how AI assistants can revolutionize your business, from automating routine tasks and improving employee productivity to delivering personalized customer experiences and bridging the AI skills gap.
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.