Fault Tolerance and Scaling

Maintaining a quorum

To ensure that both service and cluster level operations run smoothly, a quorum of cluster nodes must be running at all times. A quorum means that more than half (50% + 1) of the nodes need to be running and communicating with each other at any given moment.

Your cluster should always be designed and built to contain an odd number of nodes, which helps to maintain a quorum in both normal and adverse networking conditions. Keep this in mind when planning your deployment and looking ahead to maintaining your cluster.

# of nodes in cluster	# of nodes required for a quorum
3	2
5	3
n	(n / 2) + 1

Failure handling

Services. The health of all services in the system is monitored.

If a service is found to be unhealthy, the system will automatically attempt to self-heal, generally by restarting the process.
Service interruptions may occur depending on the type of failure.
Events regarding detected failures can be viewed in the Cluster Management dashboard.

Nodes. When a cluster node becomes unavailable for any reason, whether planned or unplanned:

The cluster will generally move the services that had been running on that node onto other nodes.
It may take five minutes or more for a node to be recognized as unavailable. This delay is designed to prevent unwarranted service disruptions that could be triggered by temporary conditions, such as intermittent network issues.
Instructions are provided (in Cluster Management - Nodes help) for gracefully shutting down or rebooting a node. These should be used any time a node is shut down or rebooted.

Scaling

If you need additional capacity beyond what the standard deployment provides, choose from these two options:

Vertical scaling - Add more memory and CPUs to your nodes. This is the suggested method for scaling as it does not require the additional complexity of managing more nodes.
Horizontal scaling - Add more nodes to your cluster. While this allows you to scale as much as needed, it involves managing additional nodes.

Important

You must always have an odd number of nodes in your cluster.

Headroom

When building a fault-tolerant cluster, each node must reserve a minimum level of free compute resources so that it can take on additional load when needed.

When scaling vertically, we recommend doubling the required system requirements.
When scaling horizontally, this has been factored in to the system requirements.