Production Systems and Testing Separation

The first rule of managing deployments that people actually use is: don’t do anything that will affect the availability of the service. Ever. Once you get something setup and working, don’t touch anything and don’t break anything. This is a difficult challenge: maintenance often interrupts service. Any change to a system, including required configuration changes, upgrades, or system updates can break a working system. At the same time every deployment and system requires configuration changes, upgrades and updates. Balancing these requirements presents a core problem in systems administration.

Ensuring a stable production environment requires appropriate testing infrastructure and sufficient policies and automation around application deployment to ensure that deploying new software or running software updates are reliable and do not require manual intervention.

This document addresses both halves of this problem–the infrastructure and the policy–and includes different method, techniques, and strategies that make it possible for administers to ensure reliable application and system updates.

Key Concepts

The following concepts introduce crucial components and requirements of deployment testing infrastructure.

change control
Systems that monitor production environments for changes and modifications and alert administrators of intrusions. As a result, change control is typically used as a security measure. At the same time, sometimes change control can refer to policies that control how changes propagate to the production systems, and change control tools can be used to enforce testing and staging policies.
development environment
A deployment of your application or application for exploratory and development use. While these environments resemble the production environment, they are often much smaller in terms of available resources and data. Development systems are what administrators and developers use to test and experiment with changes before implementing them in the test environment. These may run in virtual machines that resemble the test environment, or on developers laptops.
fire call
A method used by administrators to get emergency administrative access in deployments where no individual has “root” or administrative access. This both allows emergency work on a deployment and works to allow administrators to avoid accidentally modifying systems.
production environment
The servers that clients actually interact with. These servers are doing the actual work of the deployment. Administrators should modify instances and applications running in this environment as minimally as possible.
rollback
Refers to the process of “unwinding” a change applied to a production environment, that reverts the environment to a previous known working state.
test environment
Servers that provide a very accurate reproduction of the production environment. Use this as a final testing substrate before implementing changes or updates on the production system. Sometimes it’s not feasible to fully replicate production environments in test, but the differences ought to be minimal so that differences between the test and production environments don’t cause unexpected problems. Virtualized production and test environments make it easier to more accurately replicate the production environment in test. Some organizations and resources refer to test environments as “staging environments.”

Testing Infrastructure

Having a secure and reliable testing environment is essential, and without one its impossible to verify that changes a are “good” before deploying them. However, the extent of your testing, and the tolerances based on the administrative requirements and from stakeholder [1] needs.

When configuring your test environments consider the following basic requirements:

  • Testing needs to be easy. In order to ensure that you and your developers will rigorously test changes and updates, testing changes needs to be trivially easy. In addition to any other interface, it must be easy to restart and reset the environment to “base configurations,” to “back out” of bad configurations easily.

    Usability in this case also mandates some measure of performance. If it takes too long to reset an environment, or if the environment is too slow for any number of reasons, test environments are less likely to get used.

  • Using a deployment automation system, either something custom based on build scripts, make files, or something similar to ensure consistency.

  • Use virtualization to isolate environments. Tools like “Vagrant great for this purpose (assuming that it doesn’t take too long to rebuild the test infrastructure.)

[1]

While the term “stakeholders,” comes to us from the world of management and bears a certain amount of distaste in the minds of most systems administrators, it’s useful to be able to recognize where operational needs originate.

For some services, the administrators are the main consumers or stakeholders. Directory services, management tools and databases, logging and monitoring systems, and so forth are all primarily used to support infrastructure. For other systems: file servers, web-based applications, and so forth, other groups dictate operational requirements and tolerances.

Deployment Processes and Policy

Once you have the infrastructure to perform testing, it’s important to ensure that you do perform tests. Software developers use continuous integration systems to automate tests, and in some cases you can automate testing for deployments work using a similar method. Often, the kind of testing that administrators need to do is more complex.

Where programmers can often write test cases that verify the behavior of a program, operational testing requires not only that a single program behave correctly, but rather that an entire collection of programs behave correctly together in a specific environment. In the process of testing it’s important to be able to affirmatively answer the following questions:

  • Will this upgrade or change break any dependent service? For example, does upgrading an LDAP (directory) service impact email services?
  • Does this upgrade or change introduce any (new) client compatibility issues? For instance, would switching to SSL SNI for HTTPS break compatibility with clients that you must support? (In this case, as of 2011, the answer is usually yes.)
  • Does one change (i.e. deploying a new version of an existing application,) require configuration changes (i.e. the installation of a library, changes to networking rules, or changes to files beyond what’s contained in the upgrade itself?

There are a number of different policies at the organizational level that can help you support testing requirements. Typically these standards and practices revolve around making testing easier, less burdensome, more automated, and more integrated into existing workflows. For instance, consider the following:

  • Mandate reviews and signoffs for changes. Make sure that except for fire call situations, more than one administrator is responsible for reviewing and signing off on any change. This is not possible in small teams and for some sets of changes. Also, while these multi-sign off policies lengthen timescales considerably fresh eyes and different perspectives are quite useful and prevent many bugs and issues.

    If you manage configuration and deployment programatically, all changes to the production system must be code reviewed before propagating it to the production system.

  • Integrate testing into other tools and workflows. Including testing infrastructure that is either automated (of the “continuous integration” type,) connected to change requests and ticketing, or integrated with version control tools.

  • Provide local preliminary (“dev”) testing. If you and your developers and administrators have an easy way to test changes, and become familiar with software, it’s more likely that you and other administrators will test code regularly and that you’ll do experimentation with test and production environments. Lower barriers to entry are key to ensuring that developers use these systems.

Rollback

Controlling access to resources and providing testing environments is crucial for maintaining production systems. While there are no substitutes for implementing policies and procedures to protect deployments and ensure that updates and upgrades go smoothly, it’s important to provide a rollback option when an upgrade has unforeseen consequences. These allow you to return a deployment to a previous “known working state,” if something breaks.

There are a few methods/technologies that you can use to provide rollbacks:

  • Use LVM or some other file-system or block level snapshotting tool to create a backup of a system before applying the change. If something goes wrong with the upgrade process, you can rollback to a previous state.
  • If ruining in a virtualized environment, duplicate the instance, and upgrade the server, and perform a manual failover (swap the IP addresses) if you need a rollback. Ensure that modifying the IP address works (i.e. send appropriate ARP requests.)
  • Use a script to apply the changes, and write a rollback function to reverse all changes that you apply, and test both in your test environment.

In general, you should script and automate rollbacks–like deployment processes–so that it’s possible to back out of an update without needing to remember the sequence of operations that you performed to update the system. Sometimes this is reasonably complicated, as in the case of operating system updates and upgrades. In other situations it may be as simple as changing a symbolic link, as in some application deployment schemes. Above all, remember to be as rigorous about rollbacks and testing as you are about the updates themselves.

Change Control

Change control software monitors systems and applications to insure that configuration remains constant and that configuration changes are not implemented outside of normal change policies. This is typically implemented as a special kind of monitoring or intrusion detection system.

While it’s important to develop policies regarding changes to production systems, it’s also important to provide some method to ensure that the system or systems remain intact and that some untracked change to the production system don’t either impact the integrity of the system or affect the operational conditions of the systems. Change control may help you address this requirement.

Change control is a difficult problem, and it’s beyond the scope of this article. As a security practice, it’s reactive and reliable change control is difficult to implement effectively. [2] For many (or most) deployments, the kinds of typical intrusion detection solutions used for change control are overblown.

Even if your deployment does not merit a change control solution, collecting some “change control data” may be useful. For instance, log ins and daemon restarts may indicate some tampering, and your logging and monitoring system should track these events. Also, use privilege escalation systems like sudo that provide more logging rather than shared privileged accounts for administrative tasks.

[2]Typically if a user has the access to be able to impact a production system, they also have the ability to affect the change control monitor itself. Beyond this, change control systems cannot prevent intrusions or unwanted modifications except through Foucauldian methods, and can only report on them after the fact.

Policy, Auditing, and Production Testing

Maintaining separation between test and production environments, as well as a usable and reliable deployment systems is not a significant technological problem: it’s a social and policy problem. To properly address these concerns you need some required infrastructure, but really need to develop sufficient policies and procedures that make sense in the context of your environment and that all of your administrators and operators can work within.

Devising policies that are functional from an administrative use perspective is a requisite first step, but it’s also important to ensure that the policy is also sufficiently flexible. A rigid policy may not allow for timely administrative response to unforeseen bugs or system events, which can be devastating. So called “fire call” systems are useful for providing an emergency exception: again, this is a thin technological wrapper around a policy problem.

Full-scale auditing is often unworkable: of logs in large clusters, of file system changes on any system, so while some level of auditing may be useful for “covering” and protecting your systems, the truth is that it’s not possible to fully audit production and test systems. In light of this, the most important aspects of maintaining sane deployment policies and practices are (in descending order:)

  1. Make testing infrastructure and systems available and easy to use.

    It’s difficult to test effectively if there aren’t properly configured test machines. Furthermore developers and administrators are unlikely to test effectively if the testing system is difficult to use.

  2. Make sure that testing environments resemble production systems to the greatest extent possible.

    The greater the differences between the test environment and the production environment, the less effective the test environment becomes at predicting what will happen in production.

  3. Automate testing.

    For important components use automated testing methods, either with continuous integration systems or by some other means, to ensure that most routine testing is ongoing and does not require active developer initiative.

  4. Create and test rollbacks, to ensure that even if an update does not go as planned, it’s possible to return to a known working state.

  5. Limit changes to production systems.

    Use access control systems and monitoring tools to ensure that production and testing systems remain consistent and don’t drift from each other.