Failover in cloud computing is a process that allows switching between the primary and the secondary component of the cloud platform. The failover process triggers due to unexpected downtime or scheduled maintenance. The process may require human intervention, such as administrator’s confirmation. However, usually it is done automatically. The primary goal of failover is to keep cloud services running – thus provide the high-availability of the cloud platform.
Basic failover principals
Once the primary source becomes unavailable, the failover mechanism should transfer the traffic to a redundant instance. The instance handles client requests in the same way as the primary instance. The goal of the failover is to perform the whole process fast enough, so that the client cannot even notice any downtime of one of the instances.
When designing a highly available system, it is important to have no single point of failure (SPOF) in the system. SPOF is a part of the system with no redundant node. In case of SPOF failure, the whole system is paralyzed. The most common SPOFs are routers, VMs and local storages. To avoid SPOFs, it is important to ensure that in case of failure/maintenance of the original node there is always an alternative infrastructure that keeps the system in operation. The most common way is to create clusters with multiple resources of the same type.
In order to be able to perform a failover properly, there is a need to prevent data loss as well. The best practice is to use shared data storage instead of local storage. It ensures that the data is reachable from any node. Therefore, a failure of one of the nodes does not affect the availability of shared resources.
A stateless service does not need to store any information regarding the current state of the service. Therefore it treats each request as an independent one. In other words the service does not keep track of requests it served earlier. To ensure high-availability of a stateless service, it is enough to create redundant instances and to load balance them.
In a stateful service the result of the first request affects execution of subsequent requests. The service stores information about each request it served. Therefore redundancy and load balancing does not solve the problem of providing high-availability. One needs to ensure that the state of the secondary instance is in sync with the primary one.
active/active vs. active/passive
There are two failover configurations that can be applied to provide high availability of stateful services – active/active and active/passive.
In the active/passive configuration, a system runs one primary (active) instance and the secondary (passive) one that can be brought online. Once an outage of the primary instance is detected, the secondary one takes over the workload. It is obvious that in the active/passive configuration, the system needs some period of time before the secondary instance is ready for service. The length of the period differs with respect to complexity of the system that is being switched over and the type of stand-by mode the passive instance is kept in.
In the active/active configuration, the instances are maintained by the system concurrently. Therefore, the state of the secondary instance is always in sync with the primary one. If the primary instance is brought down, the secondary is immediately ready to take over. Since the instances are identical and both are operating, the active/active installation is often used as load balanced one. There is actually no primary and secondary instance, because the instances can exchange their roles.
In general, it is desirable to use the active/active configuration to guarantee the shortest possible downtime or no downtime at all. On the other hand, it requires to keep all redundant instances in the same state (for example ensure the data consistency). However, implementing active/active configuration in stateful services is challenging task. The active/passive installation is used for systems where the active/active installation is impossible or if designers want to save computation resources and thus money as well.
Health check is a feature that allows to determine the status of given service. The goal is to recognize a failure soon enough, so that the system could perform failover without exposing the issue to the end user. An instance is considered healthy if it is capable of responding to requests correctly and communicating over the network.
The health check redirects the communication to other nodes once it detects an unhealthy service. To bring the broken service back online, the health check can try to restart the service instance or even the whole node on which the instance is running. The process may require administrator’s confirmation, but usually is triggered automatically. As soon as the instance is back online, it joins the cluster and can handle the requests again.
Now, let’s take a brief look at some failover strategies and technologies:
Pacemaker and HAProxy
One of the ways to provide high availability of a system is to combine Pacemaker with HAProxy. The role of Pacemaker is to create a service cluster and restart HAProxy if it is not running. The HAProxy is a proxy service that processes requests and selects an instance to which the request is passed. Once a service processes the request, it directs the result back to the original source. Passing the requests to different nodes ensures load balancing.
HAProxy itself is a SPOF, so it is necessary to create an HAProxy cluster to provide failover. Pacemaker organizes each service in HA cluster.
Pacemaker is a high-availability cluster resource manager allowing active/active and active/passive communication. Pacemaker includes a health check monitoring and is often used with Corosync as communication system.
The data integrity inside the Pacemaker cluster is ensured by fencing agents. The STONITH fencing controls the consistency between the nodes and eventually shuts down the inconsistent ones. The fencing isolates the corrupted service so that this service does not cause any damage to the rest of the system.
HAProxy is a load balancer and a proxy service for HTTP/TCP. It is capable of handling and further distributing the HTTP and the TCP requests.
The HAProxy offers various load balance algorithms ensuring the load balancing cluster to be scalable for different needs of its applications. If we need a client to communicate with one specific node, we can either use another balancing algorithm (e.g.
source) or a stick table.
Docker Swarm, Kubernetes and Apache Mesos
There are various solutions handling containers that can be turned into clusters to provide a failover. The existing approaches differ in many aspects, but still all of them are dealing with containers as a key feature. To explain advantages of a failover with container clustering, we will describe a difference between a Container and a VM first.
Containers vs. VMs
Using containers enables us to make the infrastructure a lot easier and lighter. When using a VM, the physical machine consists of a host OS running a VM hypervisor. This hypervisor manages VMs running on the node. VMs have their own operating systems with applications running it.
We can let the Host OS run a container engine system, managing a pool of containers on top of it. These containers share the host OS of the metal they are running on. Although the applications on one machine run on the same operating system kernel, they are isolated from one another with containers, so an application running in one container is unable to affect another one running in the second container.
HA in container clusters
Similar to Pacemaker, one can create a cluster of containers as well. A cluster of containers is a pool of application hosts that are being controlled by a container cluster manager. The cluster manager is responsible for allocating shared resources to those applications that require it at the moment. You can visualize the container cluster as a typical Linux OS, only at a different level of abstraction – the cluster manager distributes resources of the cluster to those application instances that require it at the moment. Unlike clustering with Pacemaker shown above, container clustering allows you to incorporate various different redundant resources within one cluster. You do not have to divide every kind of application into a single cluster. Having a diverse applications in a cluster enables the resource manager to operate with bigger amount of resources and thus to distribute the resources more effectively.
The high-availability of running services is ensured by creating the cluster of containers. Once there are redundant resources within the cluster, it is possible to perform failover to these resources in case of a failure of the primary system. The load balancing of resources inside the cluster is ensured by the container cluster management system. The manager is responsible for allocating resources to the application. Once you have a cluster with redundant services on different nodes, you managed to launch a highly available service.
Apache Mesos is a distributed systems kernel. The infrastructure is ruled by Mesos master daemon – this resource manager ensures the load balancing and tasks distributing strategy, and enables a fine-grained sharing of resources. The master daemon is responsible for resource management across frameworks. Every node has its mesos slave daemon that allocates resources to applications that require it.
Apache Mesos has a high-availability mode integrated to provide failover automatically. With the high-availability mode switched off, you only have one master daemon so it becomes a SPOF. The high-availability mode creates a cluster of redundant master daemons on which the primary master can fail over. Apache ZooKeeper is used to pick a leader master and to inform all slave daemons which master is currently active. In case the master daemon fails, ZooKeeper automatically chooses a new master and informs all the slaves.
In order to provide a failover solution with Docker containers, you can use Docker Swarm. Docker Swarm is a lightweight clustering system and is relatively simple to use. The management of the cluster is handled by a swarm manager. The manager itself is however a SPOF. You can avoid this SPOF by creating redundant managers that operate in stand-by mode until the primary service is brought down. Once the primary manager is brought down, Swarm picks a new leader to manage the cluster resources.
Kubernetes is a Linux cluster manager designed for running distributed applications. It enables to group a couple of applications into a Pod. This group of containers works together to provide the desired service. Pods share the same IP address, but still the containers within the Pod are able to communicate one with one another without NAT even if they are not on the same node. This makes Pods quite easy to maintain and to manipulate with.
To create a highly available Kubernetes cluster, we have to ensure a failover node for cluster manager and API sever – these are SPOFs by default. There are two possibilities for failover. Load balancing with an active/active or an active/passive configuration with a primary instance running and a secondary in a stand-by mode. Should one of these services break down, the replicas are able to keep the system running.
Anycast routing with Quagga
Quagga is an anycast routing software. It implements various routing protocols, such as OSPF (open shortest path first), RIP (routing information protocol) and BGP (border gateway protocol). This means that Quagga software is able to provide a node with these routing protocols as if the nodes were routers.
The fundamental idea of failover through anycast routing is that the router that is forwarding the request can target any instance with anycast IP address. Because every node in the anycast topology is actively communicating with a router of its network, the OSPF daemon becomes immediately aware of a failure of one particular node. In that case, it would not forward any more requests to that failed node but other healthy nodes stay in the topology.
The OSPF protocol is quite useful for providing an anycast failover. This algorithm communicates with the node that could be reached the easiest. When more nodes with such metric occur, the ECMP (equal-cost multi-pathing) strategy is the best practice to ensure load balancing between equal nodes – it uses the round robin algorithm. We can use Flow-pinning as well – it means that we are able to assure that one IP address would always be bound to one particular node.
Because of the simplicity of the model, it is quite flexible. On the other hand, it is more suitable for smaller systems.
There is a lot of different approaches to providing a failover of cloud resources at the moment. The article presents the basic idea of failover and gives a short overview of a few high-availability tools that you could use in the cloud. We at Cloud&Heat would like to highlight the cloud-specific failover solutions managing clusters of containers. Providing a failover solution comes with an increase of consumed resources caused by creating redundant instances. Using containers can reduce the resources an application running on VMs would use. In contrast to VMs, containers do not require to run an additional layer of OS. When using containers, the overall increase of memory and CPU usage caused by launching redundant instances is minimized.
To begin working with high-availability in container clusters, we would suggest the Docker Swarm because it is a very universal and scalable system that is easy to set up.
In general, every technology has its advantages and disadvantages so the right one should be chosen with regards to the system it is planned to be installed on. Read the documentation of specific projects to get detailed information of the technology the project uses.