Share
IN-DEPTH TECH
10 min read

Share
A few weeks ago we discussed the way that we integrated Kubernetes federation v2 into Pipeline, and took a deep dive into how it works. This is the next post in our federation multi cloud/cluster series, in which we'll dig into some real world use cases involving one of Kubefed's most interesting features: Replica Scheduling Preference.

Note that every multicloud or hybrid cloud use case requires different architectural approaches - built on our
cluster groupfeature, the Pipeline platform supports multiple scenarios, while maintaining the same clean and consistent UX experience
- Multi-cloud application management
- An Istio based automated service mesh for multi and hybrid cloud deployments
- Federated resource and application deployments built on Kubernetes federation v2
- the Pipeline hybrid cloud controller manager - a Kubernetes
nativehybrid cloud approach, which you can expect from our R&D lab soon
When you create a FederatedDeployment, by default the number of replicas will be the same across all member clusters. You can override replica counts per cluster in the event that you don't want to distribute replicas equally accross clusters/clouds. An alternative way of specifying number of replicas in each cluster is to generate a ReplicaSchedulingPreference. By using a ReplicaSchedulingPreference, you can specify replica counts, while simultaneously specifying the total replica count and weight of each cluster. This is extremely useful when you want to scale your deployments and you have more than a few clusters to replicate across. Even more interesting is the rebalance feature of ReplicaSchedulingPreference. When enabled - and by default it is not - this feature monitors the replica pods for a target replica workload from each federated cluster. If it finds that some clusters are not able to schedule those pods, it moves - rebalances - replicas to clusters where all the pods are running and healthy. In other words, it moves replica workloads away from those clusters which are running out of room and to those clusters which have the adequate capacity. Below, you will find an example of 12 replicas being distributed, 66% of them on the cluster banzaionprem.
apiVersion: scheduling.kubefed.io/v1alpha1
kind: ReplicaSchedulingPreference
metadata:
name: test-deployment
namespace: test
spec:
targetKind: FederatedDeployment
clusters:
banzaionprem:
weight: 3
banzaispot:
weight: 1
rebalance: true
totalReplicas: 12What is important, here, is that ReplicaSchedulingPreference have the same name as the target FederatedDeployment. ReplicaScheduler will modify the FederatedDeployment resource, adding replica count overrides to the clusterOverrides section, and will similarly modify the placement section, which means that, if you've used clusterSelector to select target clusters, that selection will be overridden. You can check out how the FederatedDeployment resource is updated via ReplicaScheduler later in this example. Also note that, once you delete the ReplicaSchedulingPreference, the deployment will remain scaled (as it is, we don't know its prior state). And that ReplicaScheduler is able to handle Deployments and ReplicaSets as well. SchedulerManager is responsible for starting up a controller for each Scheduling Preference, which in the case of ReplicaSchedulingPreference is ReplicaSchedulingPreferenceController. As you will see, the scheduling feature in Kubefed is implemented in a way that's generic and extendable, so you'll be able to write your own Scheduler if you need too. At this point, only ReplicaSchedulingPreference is available, but hopefully there's more to come, like JobSchedulerPreference and HPASchedulerPreference (we're even working on some of our own). SchedulerManager starts a plugin for each target Kind - FederatedReplicaSet and FederatedDeployment - handled by the ReplicaScheduler. As you will see, these plugins are actually responsible for updating target resources. Besides the Scheduler, you have to implement a SchedulingPreferenceController and one or more plugins. ReplicaSchedulingPreferenceController starts the ReplicaScheduler and also watches for ReplicaSchedulingPreference resource changes. The completed flow can be seen in the diagram below:
ReplicaScheduler's schedule cycle is triggered by ReplicaSchedulingPreferenceController's reconcile loop, which is triggered by events related to deployments on a member cluster, or ReplicaSchedulingPreference resource changes. In other words, it's triggered whenever the replica counts related to a given deployment on a member cluster change. For this to work, the ReplicaScheduler fetches pod statuses from each member cluster, counting running and unschedulable pods alike. The distribution of replica counts is implemented in Planner, while the actual update is done by the Plugin component that corresponds to the type of cluster federation.
While Kubefed is still in beta (but stable enough for us to start using it), we have customers who have already started their proof of concepts, using the Pipeline platform. While these vary based on whether they are on-prem, 100% cloud or a mix of multi and hybrid cloud deployments, we have collected some of the more interesting use cases we've seen or have been working together on with our enterprise customers.
ReplicaSchedulingPreference, you can take down one or more clusters for upgrade or maintenance and, given the capacity is there (which is often the case on-premise, or is easily increased in the cloud), the desired number of replicas will be the same as the number that will be automatically scheduled on other member clusters.ReplicaSchedulingPreference, so that you deploy 100% to the cloud. However, if you run out of resources temporarily, your deployment will be balanced to on-premise, until the clusters are scaled out. Note that Pipeline can provide predictive scaling based on metrics as wellTo demonstrate how ReplicaSchedulingPreference works in practice, we chose the latest usecase from the above list (customers running a large CI system on-premise with attached priorities that they scale out into the cloud). We will be using the same Satellite application we did in this, previous post, and will create and federate clusters in much the same way.
Kubernetes clusters on AWS, using our own lightweight CNCF certified Kubernetes distribution, PKE - one fixed sized and one spot cluster with autoscaling enabled. banzaionprem is intended to play the role of an on-premise cluster, so will contain a fixed size on-demand nodepool with no scaling enabled, for demo purposes only, with one c4x.large on-demand instance. banzaispot will have one nodepool with c4x.large spot instances with one node, but with the ability to scale to three nodes.The Pipeline platform automates all this for you and supports five clouds and six different Kubernetes distributions. As a matter of fact, it's possible to import any Kubernetes distribution into Pipeline.
Should you inspect the API calls behind the UI wizard, you'll find that there are two API calls: one to create a cluster group and a second to enable its federation feature.kubectl, and the kubeconfig of the Member cluster to watch replicas be moved over there.Deploy our Satellite application as a federated deployment with preferences set so that it runs 66% percent of its replicas on the banzaionprem cluster.
kubectl create ns test kubectl
create -f
https://raw.githubusercontent.com/banzaicloud/kubefed/demo-examples/example/demo-rsp/federatednamespace.yaml
kubectl create -f
https://raw.githubusercontent.com/banzaicloud/kubefed/demo-examples/example/demo-rsp/federateddeployment.yaml
kubectl create -f
https://raw.githubusercontent.com/banzaicloud/kubefed/demo-examples/example/demo-rsp/deployment_sched_pref.yamlCheck replicas on both clusters
banzaionprem -> kubectl get
deployments -n test
NAME READY UP-TO-DATE AVAILABLE AGE test-deployment 9/9 9
9 26m
banzaispot -> kubectl get deployments -n test
NAME READY UP-TO-DATE AVAILABLE AGE test-deployment 3/3 3
3 26mYou can also take a look at the FederatedDeployment resource spec.overrides section to see overrides made by ReplicaScheduler:
kubectl get
federateddeployments.types.kubefed.io test-deployment -n
test -o yaml
... spec: overrides:
- clusterName: banzaionprem clusterOverrides:
- path: /spec/replicas value: 9
- clusterName: banzaispot clusterOverrides:
- path: /spec/replicas value: 3 ...Now let's deploy some high priority workloads on our banzaionprem cluster. Actually, we will place the same test application but with a much higher pod priority. To set priority for a pod, you have to first create a PodPriorityresource. Pods not associated with a PodPriority resource will have 0 priority and will be preempted by the banzaionprem cluster. Our expectation is that that will rebalance Pending pods to the banzaispot cluster.
kubectl create -f
https://raw.githubusercontent.com/banzaicloud/kubefed/demo-examples/example/demo-rsp/high_prio_deployment.yamlAfter a few minutes you should see the following deployment replica counts:
banzaionprem -> kubectl get
deployments -n test
NAME READY UP-TO-DATE AVAILABLE AGE highprio 8/8 8 8
4m38s test-deployment 3/9 9 3 61m
banzaispot -> kubectl get deployments -n test
NAME READY UP-TO-DATE AVAILABLE AGE test-deployment 9/9 9
9 61mNote how on banzaionpremthere are 8 replicas of the highprio deployment running, meanwhile, there are only 3 of 9 replicas of test-deployment running. On the banzaispot cluster, however, there are 9 replicas running. Note that Pending pod of test-deployment didn't disappear from the banzaionprem cluster, even when the ReplicaScheduler pushed replicas out to the spot cluster.
Now scale down the highprio app, to see if replicas of test-deployment will again be rebalanced to the banzaionprem cluster.
kubectl patch deployment highprio
--patch '{"spec":{"replicas":1}}' -n testGive the ReplicaScheduler a little time, then check deployments on both clusters:
banzaionprem -> kubectl get
deployments -n test
NAME READY UP-TO-DATE AVAILABLE AGE highprio 1/1 1 1 39m
test-deployment 9/9 9 9 96m
banzaispot -> kubectl get deployments -n test
NAME READY UP-TO-DATE AVAILABLE AGE test-deployment 3/3 3
3 97mAs you can see, test-deployment is back to its original state, right where it was before deploying high priority pods.
I hope that this raw technical content and demonstration was useful in helping you better understand Kubernetes federation v2. As usual, we are hard at work making the Pipeline platform the most complete and feature rich multi-/hybrid-cloud platform; we're always looking to add options that allows us to experiment with the latest technology available. Therefore, if you have any questions or suggestions, don't hesitate to contact us on GitHub, LinkedIn, Twitter or Slack. We're happy to help.

Get emerging insights on innovative technology straight to your inbox.
Outshift is leading the way in building an open, interoperable, agent-first, quantum-safe infrastructure for the future of artificial intelligence.

* No email required
The Shift is Outshift’s exclusive newsletter.
Get the latest news and updates on agentic AI, quantum, next-gen infra, and other groundbreaking innovations shaping the future of technology straight to your inbox.
