Skip to content

Reliability & Resilience

Every system fails eventually. Whether that becomes a minor disruption or a serious incident depends on decisions made before the failure — not during it.

What is the actual cost of downtime for your business?

This is the first question to answer, because it determines everything else. An internal tool going offline for an hour is a different problem than a customer-facing service doing the same.

Two concepts help frame the conversation:

RPO (Recovery Point Objective) — how much data loss is acceptable, expressed as time. This drives how frequently backups need to run.

RTO (Recovery Time Objective) — how long the service can be down before it causes serious business impact. This drives how ready your recovery path needs to be.

There is no universally correct answer. What matters is that it is deliberate, documented, and reflected in your architecture.

What happens when a single component fails?

Exoscale zones are independent — a failure in one has no effect on others. Within a zone, a single instance is a single point of failure.

Instance Pools with the Network Load Balancer let you run multiple instances behind a shared endpoint. If one fails, the others continue serving traffic. Anti-Affinity Groups spread those instances across different physical hosts, so a hardware failure does not take more than one instance at a time.

For containerised workloads, SKS provides a managed Kubernetes control plane with node pool autoscaling.

For databases, Exoscale DBaaS handles failover, patching, and backups. It covers PostgreSQL, MySQL, Valkey, Kafka, OpenSearch, and Grafana — each with a published SLA and documented architecture.

Have you tested your recovery?

A backup never restored from is an assumption. A failover process never practiced is a plan, not a capability.

Instance snapshots provide point-in-time recovery for compute. DBaaS backups and restore covers the database side. Both should be tested, not just configured.

Going deeper

Last updated on