Cloud computing is an innovation of marketing folks not technologists. Although “the cloud” has heralded innovations in business, technology, and practices, the more you delve into “cloud technology,” the older the technology itself seems.
There are many components of cloud technology, but this document will assume the following basic components of “the cloud:”
Computing is non-local and the user interface is a thin layer on top of a more robust set of APIs.
As a corollary, the actual computing instruments that people handle have shrunk in size and in computing power (see: smart phones and tablets.)
Vendors provide technology as a resource and service rather than a product. This affects both delivery and billing, and is true at all layers of the stack: from network and “hosting” services to user-facing applications.
As a corollary multi-tenancy, particularly vis-a-vis operating system level virtualization (i.e. Xen, KVM, VMWare,) has provided technology firms with an avenue to control costs.
Open source, at least insofar as permissive licensing and access drives adoption, reigns.
Network connections are universal and assumed as a prerequisite for any computing activity.
While these technologies and patterns are real, and have shaped and reshaped the way that the technology industry operates and does business, it’s remarkably similar to a series of other “revolutions in computing,” over the past 30 years or so. Consider:
In the end cloud computing is really just a new collection of buzzwords and framework for thinking about the same kinds of technological questions that systems administrators have been addressing for the last 30 years. Systems Administration for Cyborgs addresses concerns relevant to all kinds of systems administrators regardless of environment: cloud or otherwise. Nevertheless, there are some special considerations for those working with cloud computing that systems administrators be aware of when designing architectures for deployments and communicating with vendors and developers, and this article addresses some of these issues.
[1] | At the same time the systems architecture needs to, at least in some senses, reflect the economics of the technology. While this is an easy ideology to subscribe to, it’s incredibly difficult to implement, at different levels of the stack, with existing technology. If you use ephemeral infrastructure, then all layers of your application stack must be able to operate (and presumably remain available) on infrastructure that stops, restarts, and fails without warning. |
The emergence of production ready virtualization technology for commodity hardware marks a good possible beginning point for “cloud” computing. Certainly virtualization isn’t new and has been common on mainframes for decades. Although x86 systems have had virtualization tools for a while, until 2004 or even 2007 this technology wasn’t “production ready.”
With robust open source hypervisors like Xen and KVM (and to a lesser extend UML and FreeBSD-style “jails” or “containers”) it’s possible to disassociate systems and configuration (i.e. “instances,”) from actual hardware. Furthermore, with simple management APIs and thin management interfaces, it became feasible (and profitable!) for vendors to sell access to virtual instances. The “cloud” makes it possible to disconnect the price and challenge of doing things with technology from the project of managing, operating, and maintaining physical machines.
Managing physical machines is still a challenge, and there is a lot of administrative work there, but rather than having “general practice” systems administrators who are responsible for applications, databases, also responsible for the provisioning and management of hardware, dedicated administrators with experience related to hardware and “infrastructure” management (i.e. networking, hypervisors, etc) manage this, while other administrators manage services. Additionally, managing a dozens or hundreds (or more!) identical systems with nearly identical usage profiles, is considerably easier than managing a much smaller number of machines with varied usage profiles and configurations.
Increasingly the umbrella “cloud” refers to services beyond the scope of the original “infrastructure,” services and now includes entire application stacks (i.e. “platforms,”) and user-facing applications. Still, virtualization and “infrastructure” services have created and/or built upon a “culture of outsourcing,” that makes all of these other cloud services possible. Infrastructure services allow developers and administrators the flexibility to develop and deploy applications, while controlling, to some extent, actual infrastructure costs.
Unlike other technologies within the purview of systems administration there are no specific pieces or even classes of technologies that define cloud computing, with the possible exception of the Linux kernel itself. However, there are some basic “cloud computing” patterns moving up from commodity “x86/x86_64” hardware at the bottom of the stack and all the way to end users, Thus this section will amble loosely through a typical stack from bottom to top:
The hardware.
Hardware does matter, and is incredibly important contrary to popular belief. Users and administrators will notice significant performance benefits using better hardware. Contemporary hypervisors have a very minimal performance overhead, the overhead exits and hypervisors they cannot accelerate performance or beyond the capacity of the system.
Furthermore, virtualized systems (i.e. “hosts,”) have greater overall system utilization. This means, that there is less “headroom” for applications and instances that need to take extra resources, even for a short time. If you’re used to dealing with under utilized hardware systems, then you will likely find this aspect of the performance profile unexpected.
In the end, this means that virtualized platforms run a little slower than systems “on the metal.” Usually this performance penalty isn’t a practical problem, but it does mean, that having fast disks, good network connections, and hosts that aren’t packed full of instances running at 100%, can have a great impact on actual performance and reliability.
The storage layer.
As on conventional servers, in virtualization storage systems are the leading bottleneck. Although para-virtualized approaches lead to better performance, the storage problem remains reasonably unsolved. There are the following challenges:
Storage requirements are huge.
Large numbers of systems, writing large amounts of data, require significant storage resources. When you factor in required redundancy to prevent hardware failures, it actually becomes difficult to fit enough storage to support workflow in a server.
Storage throughput remains the most significant challenge in virtualized environment.
Even with storage on the local disk array, storage bandwidth and throughput is not always sufficient for keeping up with demand. Which is to say, that many usage profiles can easily saturate disk resources and that the kinds of separation and isolation that other resources have don’t exist at the disk layer.
Local storage is a dead end.
There’s no way to get the kind of required redundancy (and quick recovery) for hardware failures with storage systems that are host-local. It takes too long to rebuild large RAID arrays, it’s hard to achieve the proper storage density to be economically (let alone environmentally) efficient and redundant. Furthermore, when storage is host-local, it’s difficult to flexibly balance instances among hosts.
Remote storage is fraught.
Using remote storage over local storage seems like an obvious solution, but there are significant problems with remote storage. The performance is less than local storage in most cases because there are network overhead and contention problems. High quality storage arrays of significant sizes can be quite expensive.
These factors combine to leave no good solutions to “the storage problem,” in “cloud computing.” Make sacrifices based on typical workload and other requirements.
The hypervisor.
The hypervisor, which runs directly on the hardware, controls and manages the guests or instances. Essentially, all “cloud” deployments use one of three hypervisors: Xen, which is an open source probably the leading tool in terms of size of deployments because of early stability; KVM, which consists of modules loaded into the Linux kernel and is easy to configure and feature rich; and the proprietary VMWare ESX which has robust adoption but lacks some “cloud like” functionality in its approach. [2]
The libvirt project in some way makes the choice of hypervisor more irrelevant (at least among the open source options,) and in most cases the performance variation between the various options are not particularly important or meaningful. The hardest parts of maintaining virtualization hardware is often networking, storage, and instance provisioning (i.e. creating and duplicating instances,) and not necessarily anything about the hypervisor itself. As of right now, and with the possible exception of VMWare, there is no hypervisor that makes the hard problems of hosting and infrastructure management easier or harder.
[2] | There are some nuances in the hypervisor comparison, but take the following overview in Xen, is great software and it just works, and is widely used because it was the first open source hypervisor to actually work. More recently, it’s been slow to update and it’s really difficult to get a Xen system up and running. Once you have something working, there’s never any reason to change. Getting things set up the first time, and getting Xen to work on your hardware particularly with some less-standard hardware (e.g. some RAID cards, newer proccessors, NICs) can be difficult. Amazon’s Web Services, other large hosting providers all use Xen. KVM, is really easy to get set up–it runs on most contemporary laptops, even–and is functionally sufficient. However, KVM (and the Linux kernel itself,) develops a little too fast for the cutting edge to be fully stable, and has probably suffered some large scale adoption as a result. It’s difficult to speculate on adoption, but KVM saw success early on with running Windows guest instances, and likely continues to see use in smaller scale deployments. VMWare ESX, and VMWare in general, has earned the confidence of many corporate IT departments, and typical VMWare deployments may consist of dozens of machines. It’s pricing scheme makes it less economical for large scale deployments, and historically, it’s been difficult to disconnect instances from particular hosts in the context of multiple-host deployments. |
The management operating system.
Hypervisors typically have one “instance,” that either is (in the case of KVM) or very nearly (for VMWare and Xen) running “on the metal” (i.e. hardware,) and have special administrative privileges. Access to the management operating system provides access to the networking and potentially the block devices (i.e. disks,) and commands to start, stop, pause, and migrate guest instances must all originate through these systems.
The instance management layer.
If you have more than a handful of instances and hosts, you will eventually need management tools just to keep track of available instances, capacity, and state of the entire system. This layer must interact with and potentially control the hypervisor itself.
OpenStack and Eucalyptus are two complete open source “instance management solutions,” but libvirt provides a consistent framework for building hypervisor management tools. Nevertheless, most vendors use a custom solution. End-users should be able to interact with this layer programatically via a public API and directly via a (web) console interface.
The instance.
From the perspective of the hypervisor, the instance consists of nothing more than:
From the perspective of the user, however, the instance is a complete and independent network host, with a fully functional operating system and storage system.
Most administrators of “cloud” systems begin interacting in earnest at this layer, and from this layer up “the cloud,” behaves like other systems that you may be familiar with.
The instance configuration layer.
Configuration management is optional, but it makes it possible to approach the management systems that can scale to support a large number of instances.
Configuration tools, take a base install (or even a blank disk image) and install software, edit files, and run commands until the system has reached a desired “configuration state.” Sometimes configuration management systems are just a collection of deployment scripts; other times, configuration management takes the form of pre-built image templates. Some deployments use more flexible configuration management tools like Chef and Puppet that can configure new deployments but also help these systems stay up to date in the long term.
You may think of the configuration layer as a kind of “Makefile for entire systems.” Everyone who is managing more than, say, 3 systems needs to have some strategy for managing deployments as well configuration in the long term. Not every deployment needs a special tool, but it’s absolutely crucial that you have some “story” to address configuration.
The platform.
This layer of the “cloud,” stack is somewhat elastic, but typically refers to everything that isn’t formally part of your application and that supports the application layer. Third party tools, run-time environments, and other dependencies fit into this layer as well as common services like load-balancing, caching, database/persistance, LDAP, email, monitoring, networking, and so forth.
The platform needn’t reside on a single system, and in most deployments a number of (virtualized) systems will fulfill these tasks. Indeed, some vendors even provide “platform as a service,” which can be an easy way to avoid most systems administration overhead for developers and common users. Indeed, most systems administrators spend a predominance (if not the majority) of their time working to maintain this level of the stack.
The application.
As important as “the application,” is to the end-users and to the business interests that drive the “cloud,” the application is often pretty straightforward and doesn’t require a lot of ongoing administration overhead, aside from occasional upgrades, updates, and ongoing monitoring.
Most administrators need to know very little about the “cloud stack” to make use of “cloud technologies” and to deploy or manage “cloud instances.” While this is a great strength of “the cloud,” be wary of thinking that cloud systems are just like conventional ones.
In conventional deployments, you deploy a number of systems, all running a base install, all running directly on the hardware, and with a lot of capacity. Some servers provide front-end functions like web servers and load balancers and are widely available, others provide internal functions like databases and shared file system and are only accessible internally. When you need to provide a new service, you may deploy a new system, or you may decide that one of the existing servers has capacity and you might install the service and configuration on an existing machine.
Depending on the scope of your requirements and the kind of scaling [3] you require, this conventional approach works really well, and there’s no reason to think about systems (i.e. instances) as anything other than persistent, durable, always available. [4]
Administrators who take this conventional approach in “the cloud,” will not be as successful as those administrators who take a more adaptive approach for a number of reasons:
vendors offer limited up-time guarantees.
While in most cases, up-time guarantees for cloud systems are consistent with what you would be able to obtain with conventional systems, you must add your systems recovery time on top of their server’s recovery time. “Cloud” users often need more elaborate failover plans in place to compensate. In most cases this is not an issue unless you’re beholden to a SLA or your own up-time guarantee that you can’t meet in light of your providers up-time guarantee.
systems may reboot frequently.
Occasionally hypervisors and the management layer will run into bugs that require a hard reboot. These reboots often come with little or no warning and can lead to downtime that lasts for anywhere from a few minutes to an hour or so. Depending on the provider and your internal configuration, you may loose state for some or all of this time. This is obviously a disadvantage of the cloud, but with proper availability planning and redundancy it’s possible to manage these risks. In many cases, savings in infrastructure costs, and flexibility outweighs this additional complexity compensates for this risk.
While “the cloud,” highlights the risk of unexpected system reboots, failures, and poor recovery from reboots, in truth these are real risks with any kind of systems infrastructure: virtualized or not, cloud or not.
billing is per-instance per-hour (sometimes, per-instance per-day.)
This is a major sales point for cloud vendors, and while it does mean that you can potentially save money by “only paying for what you need,” in practice cloud users end up slightly over provisioning for reliability reasons, and running systems in a more persistent way. This kind of “automatic scaling” systems tend to be most prevalent in systems that have a significant workload that can run in batched/parallel operations. While this is a significant possibility, most work requires some layer of persistence.
While it’s possible to boot instances quickly from any cloud provider, even with the best configuration automation systems, the delay before the instance is actually operational is rarely fewer than 15 minutes, and more frequently closer half-an-hour or an hour.
[3] | Many systems have stable or predictable workloads that do not require capacity flexibility. On the other hand, changes in demand and workload can impact your management strategy. |
[4] | Or nearly available. It’s unfair to assume that a system will always be available due to system updates, reboots, power and network failures, software bugs, and user errors, but in truth working systems often keep working for a long time without intervention. See the “High(er) Availability Is a Hoax” document for more information. |
Initially, the fact that administrators could create cloud instances on demand was the selling point of the cloud. Additionally some of the very early success stories were from companies and start-ups that were able to succeed because their applications were able to scale up and down to meet demand without administrator intervention. This is a great sounding idea, in many respects: you only have to play for what you actually need, and you don’t have to keep servers running because you expect that you might have demand.
While this seems like a very clever technical hack of an economic situation, there are several reasons why it’s primarily successful as part of a sales playbook:
Writing code that controls the auto-scaling features is difficult and should be custom for every application.
The proper time to expand capacity is a determination that requires fuzzy reasoning. As a rule of thumb, it’s wise to add additional resources when you’re using 80% of available capacity, but what constitutes “80%” isn’t always clear.
Some layers of the stack, like databases, are more difficult to upgrade under-load. Conversely, if you have a web-layer with a load balancer, adding additional application servers is trivial and can happen under load without challenge.
While many servers and applications have load that varies throughout the day, the difference between peak capacity requirements and off-peak capacity requirements may only be a difference of 10-to-20%. This is not a leading source of cost overrun, and therefor not a priority for administrator and developer time.
Three application servers, two load balancers, and 2-3 database servers can actually handle an impressive amount of traffic. While you might add an additional application server, these deployments are pretty efficient anyway.
It can take a while to “spin up” and instance to add to a cluster, so while it’s “quick” in comparison to adding a server to a rack, it’s not quick relative to common workload patterns.
This doesn’t mean that the cloud is bad or that there aren’t real and tangible benefits for administrators, but it’s important to balance actual “business” and application requirements with the real capabilities of the system.
See also
“High(er) Availability Is a Hoax” and “Database Capacity and Scaling“
As individual administrators gained responsibility for larger numbers of systems, it became apparent that administrators needed more robust tools and approaches for automating system configuration. While cloud computing accelerated this trend, the trend originates earlier. Indeed contemporary package management was part of an early attempt to address this issue.
When you have a potentially large number of systems at your disposal, there are a couple of fundamental shifts that emerge:
Systems can be single-purpose.
It’s possible to isolate processes and functions to specific systems. This reduces the chance that load on one instance will affect the performance or behavior of another process or system is minimal. While in practice there tends to be some slippage towards multi-propose configuration, single-purpose systems tend to work more reliably, are easier to administer, and can be easier to fix when there’s an issue.
Systems can be ephemeral.
If we have a reliable way to configure and deploy a specific class of machines. Then the up-time or durability of any particular instance or system is not crucial. Vendors can reboot servers at will and it’s possible to recover from hardware failures without too much downtime.
Reliable configuration management is crucial.
It’s not enough to be able to deploy configuration. Applying configuration to systems is a straightforward operation: run a sequence of commands to install packages, edit a number of files, and then run a series of commands to start services and finalize ant other operations.
With sufficient testing it’s possible to reliably build systems, or build reusable systems; however, the larger the deployment, the more difficult keeping these systems up to date and consistent with each other and various software updates and minor configuration changes.
There are two sensible approaches to configuration management:
Configuration enforcement systems take a recipe or a pattern and are capable of taking a base image and applying the pattern to produce a configured system. Some of the newer systems will “check in” with a master configuration server on a regular basis to ensure that the system is always up to date. These systems:
Conversely, instance templates, can be quite powerful: they’re easy to build, easy to test before deployment, and generally easy to prevent drift between systems. At the same time, template systems make it more difficult to make small configuration changes quickly. Templates are particularly well suited to deployments that have a large number of instances with a small number different configurations but a large number of distinct systems, and for generally stable deployments that don’t change very often.
There are a number of tools that support configuration management, but many administrators manage configurations with “home grown scripts” in common programming languages. Consider the following inventory of these tools:
OpenStack is a project sponsored by a consortium of web hosting and infrastructure providers with the goal of providing a “cloud in a box” solution with two main audiences:
Part of this management layer is an image hosting and management service that allows you to produce and manage virtual machine templates for your OpenStack-based cloud. While OpenStack is by no means the only way to manage and a collection of image templates, and deploy instances from these templates; OpenStack’s “Image Service” was purpose-built to manage virtual machine instance templates, which surely provides some benefit.
Note
I have limited experience with OpenStack, which may be a useful component in managing an image-based configuration management system.
Puppet, Chef, and Salt are second generation configuration management tools, for describing a configuration pattern (i.e. a recipe) and implementing or applying it to a system. These tools provide a way to describe system configuration in a regular, programmatic format it’s possible to write recipes that inherit from each other, and trivial to deploy multiple systems that follow the same template or use related templates.
Beyond defining and applying configuration, Chef, Puppet, and Salt have configuration enforcement systems that run on a regular schedule [6] that: check in with a centralized configuration server and then download new configuration if available. If the tool found an updated configuration, it applies the configuration changes and ensures that the configuration has not drifted since the last run. [5]
Initially, one of the primary goals of these systems was to abstract the differences between these systems. The idea was that administrators would define configurations separately from the actual deployment and deployment platform. If the system were complete, administrators would then apply the same configuration to Fedora systems as Ubuntu systems (for example.) In practice, however, most large deployments are not this diverse and other features–like the configuration enforcement–have become more important.
For many environments, the configuration layer, provides general system automation and deployment automation. While this may seem a bit incongruous, configuration management tools are very serviceable deployment tools. [7]
The primary difference between Puppet, Chef, and Salt is the programming interface for describing patterns: Puppet uses a more declarative and limited domain specific language for describing configurations, while Chef and Salt use the Ruby and Python programming languages respectively. Puppet and Chef are both written in Ruby, while Salt is Python.
[5] | This “enforcement mode” is optional and only runs if configured. It’s also possible to apply configuration descriptions directly, without an established configuration infrastructure. A limited on-demand operation is also available. |
[6] | Chef will check for updates and changed configurations some interval (e.g. 30 minutes) after the last operation (update check or configuration application) completed. Because system configuration operations can last anywhere from a number of seconds to a few minutes or more, this approach prevents a group of servers checking for updates at the same time and prevents the next “scheduled update run” from conflicting with long-running operations. |
[7] | There is substantial cross-over between deployment and build systems and configuration systems, at least theoretically. Both kinds of systems define reproducible processes for assembling a final output using a procedure built with available components. |
Fabric approaches the configuration management problem as a deployment tool. Fabric files use Python syntax, with a simple and straightforward interface for describing system build processes, deployment procedures, and other system automation. If you don’t have a programming background, fabric may be a good way to get started with systems automation.
BCFG2 is a first generation configuration management tool, similar to solutions like Puppet and Chef in scope and goal. However, BCFG2 defines configuration using an XML-based syntax, and has a number of features that support collecting data from target systems and using this information for auditing purposes and also to inform and shape how to apply configuration to systems.
Vagrant isn’t a configuration management tool as such, but is rather virtual machine management tool for administrators and Chef users to be able to have ephemeral or disposable virtual machine instances on desktop systems for testing. Tools like Vagrant are essential for testing configuration scripts without expending a great deal of time and resources. At the same time, you’ll also need to do a layer of tests on your actual infrastructure before deploying.
If you’re using some hosting platform that you control, like your own “private” Open Stack instance, it may make sense to do all of your testing on that platform.
Cloud computing, and the mandate to deploy and manage a large number of systems, has lead to a general reevaluation of the configuration management problem, and the development of a number of tools. Nevertheless, there is no single solution to the problem of configuration management: there are a number of very different approaches to configuration management and system provisioning and deployment.
Despite the differences in approaches and tools, a few themes practices have emerged. When developing policy for your system configurations, consider the following strategies, regardless of the tool you choose to use: [8]
Prioritize reproducibility over simplicity and ease.
When configuring systems, make it possible to easily and automatically recreate and reconfigure all of your instances. It makes a lot of sense to configure, and tune systems by hand when the systems are long lasting and you will interact with them directly and personally (e.g. your workstation, or a primary remote server;) however, most of the systems you will administer in a production environment have a single purpose or a small number of related functions, and custom configurations don’t make much sense.
At the same time it is easier and faster to deploy a base template, install the dependencies you need, attempt to run the service, fix any insufficient configuration, try again to run the service, repeating the process as much as necessary.
Resist this temptation. If you need your systems to be maintainable in the long run, if you ever need to deploy a test environment, or if you ever need to share your administrative workload with anyone else, ensure that the configuration is reproducible.
Design systems with minimal state that you can delete and rebuild without causing any application or user errors.
There is a significant relationship between reproducible configurations and minimizing the amount of semi-persistent state stored in an instance. Minimize state, the important information stored in scripts and configuration files saved locally on a few machines, makes any individual system less important and more disposable. This is another trade-off between simplicity (setting some tasks and services to run from specific systems,) and the additional work and setup costs to write a system that centralizes state. [#centralized-state]
Maximize development/operations interaction and collaboration.
The term “dev/ops” is terribly popular these days, and the term is often overused. Most often people say “dev/ops,” when they mean “the things systems administrators have always done,” but there are a couple of insights buried in the buzzword that are worth drawing out.
First, systems administrators must think about their problems in as development problems, optimization [10] and maintenance costs. Second, it’s absolutely crucial that development and operations teams communicate and collaborate.
Second, software development since the late 1990s has moved towards a more iterative model, where the users and “business drivers” of a software got more involvement in the development process; dev/ops is about a similar shift in operations work where developers become more involved in the administration of their software.
Make sure that development teams know how operations procedures work and how software and changes get deployed and tested. Also, as an administrator, attempt to make sure you know what’s going on in the technical organization or organizations (i.e. your users and “clients,”) at all times so that you’re never surprised by new requirements or deadlines.
Use versioning and track changes to production systems.
Version control systems, when properly understood, are a great asset to administrators. First and foremost, they support and enable experimentation: if you know that you can revert to any previous state instantly then it becomes easy to try new things, test different configurations, and make changes that you might not feel comfortable making under normal circumstances.
Furthermore, version control systems are simple and lightweight enough that they often make sense as a way of keeping a collection of shared files synchronized across a number of machines. Certainly for larger amounts of shared data, or rapidly changing data, this is not practical, but for configuration files and administrative scripts a version control repository is ideal.
[8] | Having a configuration management tool is by no means a requirement. Configuration tools–like all tools–are simply means to implement and enforce a process. They are particularly useful in situations where you do not have a well developed process, or must (quickly) extend an existing process to a new scale (i.e. extending a process that worked when an administrator had to manage 5 systems, to a process that will allow an administrator to manage 500 systems.) The tool alone cannot create process or enforce any particular standard: if you have a process that works, or a process that doesn’t cause you to worry when you think about multiplying your responsibility by the required amount of growth your application, then you can probably continue without a formal configuration management tool. |
[9] | Distributed systems should theoretically avoid centralization at all costs. However, for larger and complex systems, “centralized” state makes a lot of sense. Typically this just means storing state data in an authoritative database or data store, or making sure that scripts can run on all systems in a failover situation even if they only run on some systems in common practice. Also, it’s possible to configure deployments with entirely distributed services (i.e. distributed application, monitoring, and database services.) but where every layer’s internal distributed nature is transparent and hidden to the others. Which is to say, you can use a distributed database, but the application can treat the database system as an authoritative source for state-related information. |
[10] | Systems administrators, like other developers, must be wary of premature optimization and micro-optimization. |