Autoscaling an SKS Nodepool
One of the prime features of running your applications on Kubernetes is to be able to scale your cluster without your intervention based on the current workload.
In Kubernetes, one can scale:
- the Pods themselves vertically by raising and lowering resource requests using the Vertical Pod Autoscaler
- the Pods horizontally by raising and lowering the number of Pods in a deployment using the Horizontal Pod Autoscaler (See the test down below for an example)
- the number of nodes by resizing the nodepool based on either node utilization or Pod deployment requirements using the Cluster Autoscaler
Here we will describe the Cluster Autoscaler option.
Prerequisites
As a prerequisite for the following documentation, you need:
- An Exoscale SKS (Pro) Cluster
- Access to your Cluster via kubectl
- Basic Linux knowledge
If you don’t have access to an SKS cluster yet, follow the Quick Start Guide.
Deploying the Cluster Autoscaler
You can find the Kubernetes Autoscaler here and the Exoscale provider in the ./cluster-autoscaler/cloudprovider/exoscale folder.
As described there in the README, you need to first create a secret containing an appropriate API key as well as the zone of your cluster:
export EXOSCALE_API_KEY="EXOxxxxxxxxxxxxxxxxxxxxxxxx"
export EXOSCALE_API_SECRET="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export EXOSCALE_ZONE="ch-gva-2"
kubectl -n kube-system create secret generic exoscale-api-credentials \
--from-literal=api-key="$EXOSCALE_API_KEY" \
--from-literal=api-secret="$EXOSCALE_API_SECRET" \
--from-literal=api-zone="$EXOSCALE_ZONE"
Note
This API key must be authorized to perform following API operations:
evict-sks-nodepool-members
get-instance
get-instance-pool
get-operation
get-quota
list-sks-clusters
scale-sks-nodepool
Afterwards you can grab the deployment manifests and deploy the Autoscaler:
kubectl apply -f cluster-autoscaler.yaml
Note
When testing, you should adjust the commented timeouts towards the end of the file to see scale-down happening within a minute or so instead of the slower behaviour one wants in a production deployment.
Multiple nodepools
In case your cluster has multiple nodepools, you might want to tell the Cluster Autoscaler which nodepool should be scaled up and down. Otherwise a random nodepool will be scaled.
To do this, you need to create a ConfigMap
with your nodepools Instance Pool
ID (NOT the nodepool ID):
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
50:
- 00719c6e-1d06-4053-afea-8926c3431ef7
As well as adding an argument to the autoscaler deployment:
- --expander=priority
The Cluster Autoscaler will then target this prioritized nodepool instead.
Putting it to the Test
To test everything is working as it should, we deploy a DaemonSet
that is very
busy but still fit in our current deployment of two nodes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: stress
namespace: default
spec:
replicas: 2
selector:
matchLabels:
run: stress
template:
metadata:
labels:
run: stress
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: run
operator: In
values:
- stress
topologyKey: kubernetes.io/hostname
containers:
- image: nixery.dev/stress:latest
name: stress
command:
- stress
- --cpu
- "1"
resources:
limits:
cpu: 300m
memory: 30Mi
requests:
cpu: 150m
memory: 15Mi
Note how we use a podAntiAffinity to ensure these Pods can’t share a node. This reflects real requirements but the node filling up with Pods would result in a similar behaviour.
After this is is deployed, the CPU limit will be consumed in each Pod and the Kubernetes metrics show this:
$ kubectl top pods
NAME CPU(cores) MEMORY(bytes)
stress-d77bdc8db-22dwb 292m 0Mi
stress-d77bdc8db-pd4xn 291m 0Mi
We now use the Horizontal Pod Autoscaler to react on these internal metrics and increase this deployments replicas to 11:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: stress
namespace: default
spec:
minReplicas: 1
maxReplicas: 11
metrics:
- resource:
name: cpu
target:
averageUtilization: 50
type: Utilization
type: Resource
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: stress
We can see new Pods being spawned and new nodes being added after a short while as the Pods can’t be scheduled onto an existing node:
NAME STATUS ROLES AGE VERSION
pool-0f4cd-fquea Ready <none> 3d6h v1.23.3
pool-0f4cd-llrsh NotReady <none> 16s v1.23.3
pool-0f4cd-vhend NotReady <none> 15s v1.23.3
pool-0f4cd-ygwhj Ready <none> 3d6h v1.23.3
pool-0f4cd-zknko NotReady <none> 15s v1.23.3
[...]
You can look into the Cluster Autoscaler logs to see it taking action:
$ kubectl -n kube-system logs deployment/cluster-autoscaler
[...]
I0324 19:50:12.521062 1 scale_up.go:675] Scale-up: setting group 0f4cd2ad-2825-4ae7-aaa2-1fd0f8e0af19 size to 5
I0324 19:50:12.530709 1 log.go:32] exoscale-provider: scaling SKS Nodepool afe7aa21-b4f6-409b-a70b-6fc31d6fada1 to size 5
[...]
Once the new nodes are ready, the Pods will start up on these. When you remove the test deployment, superfluous nodes will be removed after a grace period.
Tips and Tricks
- It can be useful to add
--v4
to the list of arguments to get more information in the logs about why the Cluster Autoscaler makes a certain decision. - Certain Pods can prevent the Autoscaler from removing a node. See the CA FAQ for more on that.
- You can also annotate certain nodes not to be touched by the CA
- Longhorn Users: Note that the Cluster Autoscaler is not fully supported (see)