Cloud fault tolerance is the ability of a system to continue operating when part of it fails. It assumes failure will happen and designs around it.

How fault tolerance is built

Common methods include redundancy, automatic failover, retries, replication, multiple zones, health checks, queues, and graceful degradation.

Fault tolerance and high availability

High availability is the goal of staying reachable. Fault tolerance is one design approach that helps achieve that goal.

Fault tolerance is a mindset: plan for failure before it becomes an incident.