Reliability & Resilience

Every system fails eventually. Whether that becomes a minor disruption or a serious incident depends on decisions made before the failure — not during it.

What is the actual cost of downtime for your business?

This is the first question to answer, because it determines everything else. An internal tool going offline for an hour is a different problem than a customer-facing service doing the same.

Two concepts help frame the conversation:

RPO (Recovery Point Objective) — how much data loss is acceptable, expressed as time. This drives how frequently backups need to run.

RTO (Recovery Time Objective) — how long the service can be down before it causes serious business impact. This drives how ready your recovery path needs to be.

There is no universally correct answer. What matters is that it is deliberate, documented, and reflected in your architecture.

What happens when a single component fails?

Exoscale zones are independent — a failure in one has no effect on others. Within a zone, a single instance is a single point of failure.

Instance Pools with the Network Load Balancer let you run multiple instances behind a shared endpoint. If one fails, the others continue serving traffic. Anti-Affinity Groups spread those instances across different physical hosts, so a hardware failure does not take more than one instance at a time.

For containerised workloads, SKS provides a managed Kubernetes control plane with node pool autoscaling.

For databases, Exoscale DBaaS handles failover, patching, and backups. It covers PostgreSQL, MySQL, Valkey, Kafka, OpenSearch, and Grafana — each with a published SLA and documented architecture.

Have you tested your recovery?

A backup never restored from is an assumption. A failover process never practiced is a plan, not a capability.

Instance snapshots provide point-in-time recovery for compute. DBaaS backups and restore covers the database side. Both should be tested, not just configured.

Going deeper

Instance Pools — groups of identical instances that scale and self-heal
Anti-Affinity Groups — spreading instances across physical hosts
Network Load Balancer — distributing traffic and removing unhealthy instances automatically
NLB health checks — how the NLB decides when an instance is ready to receive traffic
Snapshots — point-in-time recovery for compute instances
DBaaS — managed databases with built-in high availability and backups
SKS autoscaling — node pool scaling for containerised workloads
Data Center Zones — zone locations and independence model

Last updated on April 3, 2026