Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
PRODUCT
10 min read
Share
Hollowtrees is a wave of highest pedigree, the pin-up centerfold of the Mentawai islands' surf break which brings new machine-like connotations to the word perfection. Watch out for the aptly named 'Surgeon's Table', a brutal reef famous for taking bits and pieces of Hollowtrees' surfers as trophies. Hollowtrees, a ruleset based watch-guard keeps spot instance-based clusters safe and allows for them to be used in production. It handles spot price surges within a given region or availability zone and reschedules applications before instances are taken down. Hollowtrees follows a "batteries included but removable" principle, and has plugins for different runtimes and frameworks. At its most basic level it manages spot-based clusters for virtual machines, but it contains plugins for Kubernetes, Prometheus and Pipeline, as well.
Cloud cost management series: Overspending in the cloud Managing spot instance clusters on Kubernetes with Hollowtrees Monitor AWS spot instance terminations Diversifying AWS auto-scaling groups Draining Kubernetes nodes Cluster recommender Cloud instance type and price information as a service
A few weeks ago we made a promise that we'd introduce a new open source project called Hollowtrees, which we designed to solve problems related to spending too much on cloud infrastructure. Today, we're open sourcing and giving a quick overview of what's to come and what this project's architecture looks like. The project is under heavy development and will go through architectural improvements in the coming weeks. We've been using it internally for some time (all our clusters on EC2 are spot price based) as a core building block of the Pipeline PaaS, and have deployed it to a few early adopters. We're all very exited about the possibilities Hollowtrees is sure to bring.
If you are interested in the project's architecture and components please read on.
We are running Kubernetes clusters in the cloud and using spot instances in order to be cost effective. EC2 Spot instances are available at a large discount when compared to regular instances, but they can be interrupted and terminated any time AWS requires the capacity these instances use. Actual spot prices fluxiate constantly and are determined by EC2's own algorithm, which is based on supply and demand. Kubernetes has built-in resiliency and fault toleration, which is a good fit for these instances. If a node is taken away, the Kubernetes cluster will survive, and the pods that were running on that node will be rescheduled to other nodes so life can go on. In practice, however, it's not that straightforward, mainly due to two complications:
kubectl drain
command is a good example of how to properly drain a node before it is taken away from the Kubernetes cluster, whether temporarily or permanently.So we wanted a project that was able to keep track of different spot markets and kept our Kubernetes spot instance clusters running properly, even when spot prices were surging or when instances were taken away by AWS. That involved solving the two complications we've just discussed: diversification of instance types, and handling of node terminations, on both the AWS and the Kubernetes side. We wanted to build a plugin based
project that could intervene in a cluster's lifecycle whenever specific events were triggered and was able to understand cloud provider, Kubernetes and, if needed, application level behavior. We weren't interested in reinventing the wheel, and tried to reuse as much of the CNCF landscape as possible, because it had almost all the necessary building blocks. Now we'll go through a spot instance termination scenario in order to introduce the Hollowtrees architecture and to provide an example of how the project can be used to intervene in a cluster's behavior on multiple levels simultaneously.
EC2 provides a termination notice for instances that will be taken away. It's available on an internal
instance metadata endpoint two minutes before an instance is shut down. This is handy if you want to execute some local cleanup scripts on the instance, or if you'd like to save a state to an external storage solution, like S3, but in our case we want to make the notice available outside the instance - to be able to react to it in a way that takes the state of the enitre (Kubernetes) cluster into account. The Prometheus monitoring tool seemed like a good solution to that problem. Prometheus was the second project, after Kubernetes, to be accepted as a hosted project by the CNCF and it's starting to become the de facto monitoring solution for cloud native projects. Consequentially, we created a Prometheus exporter that is deployed in a way very similar to the standard node exporter. It runs on every spot instance we start and queries the internal metadata endpoint of every collect request made by the server. If the endpoint becomes available, it reports new metrics on the /metrics
HTTP endpoint, which is scraped by Prometheus. Here's some example metrics:
# HELP aws_instance_termination_imminent Instance is about to be terminated
# TYPE aws_instance_termination_imminent gauge
aws_instance_termination_imminent{instance_action="stop",instance_id="i-0d2aab13057917887"} 1
# HELP aws_instance_termination_in Instance will be terminated in
# TYPE aws_instance_termination_in gauge
aws_instance_termination_in{instance_id="i-0d2aab13057917887"} 119.888545
In Prometheus, we can create a very simple alert like the following that notifies us when a spot instance is about to be taken away.
- alert: SpotTerminationNotice
expr: aws_instance_termination_imminent > 0
for: 1s
Okay, now that we have an alert firing in Prometheus, what do we do? We could use the Alert Manager to trigger a custom webhook endpoint, but that's not a very flexible solution; we want to have plugins with well defined APIs, in order to define execution orders and to pass different configuration parameters in accordance with different events. Alert Manager also groups alerts in a way that may not be particular suited to our needs. So we created the Hollowtrees engine, a service that accepts alert messages from Prometheus, processes them based on a ruleset configured in a yaml
config file and instructs action plugins to handle events as part of an action flow. First let's see what the action plugins must do:
kubectl drain
command: cordon the Kubernetes node by making it unschedulable (node.Spec.Unschedulable = true
) and evict or delete the pods on the node that will be terminated using the Eviction API. This process terminates the pods gracefully, as opposed to terminating them violently with a KILL signal. We work according to a prediction model (a project called Telescopes based on Tensorflow) to change spot instance types in the cluster before the two minute notice arrives, pre-size the cluster prior to job submission, and recommend instance types for whatever workload is running. That way we have more than two minutes to prepare for the shutdown, which can lead to better resiliency.These are two completely different requirements on two different levels of the stack, so it makes sense to keep their logic completely separate: we've implemented two plugins that do the work as two different microservices, but we must also somehow notify them so they execute their logic and connect to the same flow alongside Hollowtrees. To accomplish this, we've defined a common gRPC interface that accepts events that comply to current CloudEvents specifications. Hollowtrees is capable of sending events like this, so the only thing we're missing now is how to bind the disparate pieces of this process together; enter the ruleset mentioned above, which can be found in the Hollowtrees configuration file. Here's an example of what it looks like as a Hollowtrees rule described in yaml
:
action_plugins:
- name: "ht-k8s-action-plugin"
address: "localhost:8887"
- name: "ht-aws-asg-action-plugin"
address: "localhost:8888"
rules:
- name: "drain_k8s_spot_instance"
description: "drain k8s node and replace AWS instance with a new one"
event_type: "prometheus.server.alert.SpotTerminationNotice"
action_plugins:
- "ht-k8s-action-plugin"
- "ht-aws-asg-action-plugin"
match:
- cluster_name: "test-cluster"
The Hollowtrees project provides a pluggable mechanism to dynamically react to monitoring alerts on multiple levels of the stack at once. As seen in the example above, it has three important building blocks:
Please find below an overview of the Hollowtrees architecure and its communication flow.
In the coming weeks we'd like to extend the functionality of this project by adding more plugins and by adding support for Google's preemtible instances as well. Some of these already work in a way similar to the diversification of EC2 spot instances, which we mentioned at the beginning of this post (another post will follow this one), but we have a few other ideas, like managing EFS burst credits. Obviously, connecting this project with Pipeline is a top priority and work is already in progress at Banzai Cloud. We're here to listen. If you have plugin ideas or requirements, os you'd like to contribute to Hollowtrees, please let us know through our GitHub page.
Get emerging insights on innovative technology straight to your inbox.
Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.