High availability, or “higher availability,” is the practice of developing infrastructure invulnerable all kinds of service interruptions.
True high availability is an impossible challenge. Network failure, hardware failure, developer error, and administrator error can and will interrupt all services. Beyond simple errors, load in excess of capacity will often render a services inaccessible. Often “load in excess of capacity” refers to malicious traffic but frequently unexpected traffic is entirely legitimate. The confluence of these potential issues means that administrators cannot ensure one-hundred percent up time.
There are, however, ways to decrease the chance of downtime and increase the ability of a system to recover quickly from various kinds of failures and service interruptions. This document outlines the fundamentals of and approaches to building reliable systems.
See also
Practically speaking, the technologies that support “highly available” systems are the same technologies that support higher performance systems. In light of this, be familiar with the concepts introduced in the following sections of Systems Administration for Cyborgs:
The following technical concepts, terms, and technologies are particularly relevant to administrators of highly available systems. See the principles section for a more in depth discussion of these concepts.
To be highly available, a service must be:
The obivous response to these standards is: “yes, but how much.” There’s nothing intrinsic about nearly every computing service that would prevent an administrator from being tolerant to every kind of network partition, but in most cases that’s not an effective use of resources. This leads to the overriding theory that can inform all high availability work:
All services should be as highly available as possible, given the relative “business” value of the service in question.
In other words, don’t spend time and money making sure that a service will have no apparent downtime, particularly when the service isn’t absolutely mission critical and particularly when many classes of errors are exceedingly rare.
For most deployments, you can ensure that that the service will remain available despite the most likely two or three interruptions or failures, which will be good enough. For all other potential failures, good monitoring, and graceful degradation are sufficient solutions.
Highly available systems, need to be redundant so that any single host or server process can terminate or become unavailable without impacting the service. Keeping a hot standby of every server or instance can be a monumental challenge. Keeping a hot standby and maintaining the infrastructure to route traffic to the standby when the main server fails and keeping the secondary up until the primary returns, is even more difficult.
Although this is changing, with the advent of virtualization, most hosts provide more than one service. For example, an email server may also host an LDAP directory and a DNS server, while various caching layers may reside on the same instances as the application servers.
The first step toward redundant systems is separating services such that a different host provides one and only one service. on a network This provides the benefit of decreasing the likelihood of interactions between processes affecting service. While it’s possible to have redundant multi-tenet systems, it generally makes sense to avoid this kind of architecture for highly available systems.
Depending on the requirements on your infrastructure in most cases it makes sense only for a couple of mission critical services, while some limited downtime may be (more) acceptable for other services. In these situations you can have separated instances for the critical highly available services, and a couple of multi-tenant systems for the less critical services.
These architecture problems are largely financial: with a bit of menial work and enough money, you can build redundant, load balanced systems with failover.
From a technical standpoint, the biggest challenge in building these kinds of redundant failover-capable clusters relates to maintaining “state.” Which is to say that no server can be the sole source of state regarding a connection or user, although some systems may be able to tolerate some transient state loss during failover.
State is a detail of application implementation. To resolve this, developers will attempt to all state in the database or persistence layer, or by using some type of shared storage system like a network or clustered file system. Often, however these kinds solutions create more complex infrastructure requirements and add a point of failure that requires additional availability consideration.
In order for a deployment to be truly “highly available,” all layers need to be able to failover: application, database, caching, as well as load balancing and proxy. Ideally, the systems are fully-redundant among more than one data center not simply within a single facility.
Redundancy can add robustness to systems and services if the load balancing layer can distribute traffic between identical nodes or service providers. This is a common strategy for developing high performance systems but it cannot provide true high availability without some way to recover or “heal” from systems that fail or become inaccessible. The term “failover” typically describes this process and this functionality.
Failover systems operate by sending heartbeats or small “ping” like packets between all nodes. When a system stops responding to pings [1] the load balancers remove the inaccessible node from active rotation. When the “downed” system becomes accessible and starts returning heartbeats, then the failover system [2] reconfigures load balancers to add the node back to active rotation.
While the “pattern” for failover and recovery is straightforward and used by most implementations and deployments, there is great variance among specific implementations. The way systems decide that a node is “inaccessible” depends on the deployment and usage pattern of the server. For some kinds of system, an node that is inaccessible for 2 minutes between 9am and 6pm eastern is the maximum tolerance, but that same node could safely be inaccessible for 20 minutes between 10pm and 4am. Coordinate these thresholds with the monitoring solution. Indeed, all high availability systems must be tightly integrated with monitoring systems.
Ideally the cluster management tools will be able to detect (or receive notice) of an inaccessible system or service and remove it from circulation without manual administrator intervention. For some deployments it might not be practical to automate recovery: if you expect failover situations to be relatively rare with concrete causes, configuring an automatic recovery system may not be a productive use of time. In general, consider the complexities associated with robust automated failover and recovery on a balance with operationally acceptable downtime.
Possible mechanisms for providing the actual failover include:
Each method has advantages and disadvantages: Anycast network routing is fast and easy to configure because it operates on the network layer, it can be nearly transparent to the application layer. DNS based solutions are easy to configure but because DNS information is typically cached at multiple layers, changes in DNS configuration may take too long to propagate. Moving or floating an IP address is difficult from a networking perspective, and often requires the application to restart a number of common daemons as most software is not designed to handle changing IP address configurations. The correct solution depends on your deployment, your control over the networking infrastructure, the available services of your hosting provider, and the amount of required responsiveness for your high availability setup.
[1] | The threshold, or point, where the cluster or deployment determines that an instance or node is “down” or inaccessible is actually a complex determination. Because network interruptions can be transient, it may be prudent to only trigger failover if multiple heartbeats fail, or two different kinds of monitoring tests identify a downed node. While failover systems are important and make it possible to automate much of “high availability,” it’s important to not trigger failover situations based on false positives. |
[2] | Strictly speaking, adding a previously “downed” node to a current cluster is the province of a “recovery” system rather than a failover system. While some modern high availability/cluster management systems can handle both failover and recovery, conventional architecture patterns place an emphasis on failover, and in some cases require/allow administrators to handle recovery manually. |
High availability is attainable for all kinds of systems, given some engineering work and sufficient money for powerful and high quality hardware. You can create deployments that will be highly available, and only experience “downtime” for seconds a year. The expense of availability comes from the requirement to procure multiple identical instances of hardware, and contract for redundant and independently provisioned power and network services.
High availability also carries a number of significant operational requirements: every modification to a production environment increases complexity as the deployment gains redundancy and the ability to preserve state through failover. Replication itself can carry overhead that may impact production systems. As a result, even minor modifications to highly available systems become excruciatingly difficult to deploy and maintain as applications develop and needs change.
As a potential counterpoint to traditional “high availability,” “graceful degradation,” describes a process where, rather than “failover” and “recovery” parts services become inaccessible rather than totally unavailable during network or system maintenance and failure. Graceful degradation may also involve developing applications that build activity around message/work queues (that are themselves highly available,) but where the other portions of the system have a higher failure tolerance.
Planning for high(er) availability requires considering the kinds of failures that can and are likely to occur in a system deploying infrastructure to survive and compensate for these kinds of errors. But higher availability isn’t the only way to build systems that are reliable, and graceful degradation is a good example of approaching the availability challenge from the perspective of building more reliable and fault tolerant systems from a more holistic perspective.
When considering the best way to build available systems, or the best ways to increase the reliability of a service it’s important to consider:
The technologies and configurations that support high availability deployments are the same fundamental technologies and configurations that support high performance systems. For example, the kind of database clustering technology that makes replication and failover possible, is often deployed to distribute (primarily read) operations to multiple nodes for performance reasons. Similarly, load balanced, distributed application servers are great for increasing application concurrency.
Administrators and architects need to be sure to keep high performance architecture concerns and increased availability projects distinct. In practice “secondary” systems in a high availability deployment must be sufficiently robust to be able to support peak production usage
In the abstract, it makes sense to thinks about redundancy and failover as being simple “A/B” systems that replicates the entire stack primary (“A”) stack in a secondary (“B”) stack: When the “A” system fails, the “B” system takes over. Practically speaking, however, redundancy and failover need to happen on a much finer grained level. Databases need to be redundant, load balancers need to be redundant, application servers need to be redundant, email servers, directory services, and so forth. By bringing high availability down to the service level rather than the instance or system level, its possible to develop a more resilient and flexible system.
Some application designs and architectures can be more fault resistant than others. While a well designed application is not a substitute for managing availability, certain application designs make it easier to provide a higher level of availability.
For example, an application developed around highly available queue system, with worker and application systems that have no particular state or availability nodes, can be highly reliable from the users perspective without having a large number of availability-related infrastructure. In this configuration, the database system and the queue system (that may use the same database) are the only crucial components of the system, every other process can be ephemeral. Databases and queuing systems are also typically distinct and robust software packages that applications can use used without modification. This strategy revolves around making most of the application stateless and fortifying the state-holding elements of the system.
Similarly, a system with large and robust caching layer can allow application servers to become unavailable for short or medium periods of time (i.e. anywhere from 1 to 20 minutes) without affecting the overall availability of a system. Again you must be careful to ensure that the durable state (likely stored in a database,) remains consistent thought the “outage.” The general strategy here is to ensure that the system can quickly and seamlessly recover or survive short downtimes caused by system reboots, network partitions, or system updates.
Systems Administration for Cyborgs and the entire the “transformation” of systems administration into “dev/ops`” centers on the idea that systems administration, operations, deployment, and infrastructure maintenance, should be “automatable” and “programmatically managed.” While the implications and totalizing aspects of this shift will probably take several years to stabilize, high availability has long been a problem domain where success is often a function of the amount of systems automated and the programmatic administration.
Because human intervention takes time, produces inconsistent results, and can lead to greater periods of unavailability, automation is truly the only way to have robust highly available systems. There are trade-offs between the befits and relative costs of automating certain tasks. Automate all high availability systems and document their use and design.
Such deployments should perform as many of the following activities without interrupting the availability of the service or requiring human intervention:
Detect unavailable nodes or systems.
Using heartbeats, or other tests, the infrastructure will detect when components become unavailable automatically.
Failover.
When a node is no longer accessible, the larger system will remove the node from active rotation, or trigger a complete failover of the entire stack.
Resolution detection.
In cases of unavailability, after you resolve the issue, you must determine the cause (e.g. network partition, power failure, or human error.) Then ensure the system can detect and compensate for similar events in the future.
Recovery
After the resolution detection occurs, the system must automatically “undo the failover” so that the formerly unavailable nodes are once again included in rotation and will receive traffic.
See also
This kind of automation is a part of a the kind of systematic application/infrastructure automation that typifies “dev/ops.”