Backup Strategies for Cautious Cyborgs

This document provides an overview of backup strategies, methods of backup organization, and practices to that ensure your data and configuration is safe in case of a significant data loss event. While the examples that follow center on Unix-like (e.g. Linux) systems and server environments, but the higher level strategies are more broadly applicable.

Background

Key Concepts

Backups are important for maintaining continuity and guarding against certain classes of hardware failures, service disruption, and user errors. Even though many speak of “backups,” as one thing, the truth is that there are many kinds of backups that serve a number of different ends. There is no singular backup solution that can satisfy all possible needs. If you don’t have any backup strategies at the moment you should craft something, draw on your existing experience, the knowledge of your system, your budget, and the material introduced in this document to develop the strategy for backing up your deployment.

Backup Systen Requirements

In order to have value, there are certain necessary proprieties that backup systems must possess. They are:

  • tested. Regardless of your backup strategy, if you do not regularly test your backups to ensure that backups work and that you can restore data from the backups. Otherwise your backups are as good as worthless. There are many ways to test your backups, but you must test your backups by attempting to restore from them.
  • automated. Backups that you have to think about are backups that you will forget to do, backups that won’t get done when you’re on vacation, or backups that are inconsistent because you enter the command in differently every week. Automating backups means they happen regularly and that they’re consistent. It’s important to check automated backups to ensure that they’re restoreable, but there’s no reason why you need to run them yourself.
  • efficient. Backups are incredibly crucial and almost nobody, not even the best administrators, backup properly. You should design your backup system to backup only want you need and what is essential to restore your systems. You should find ways to store backups so that you can minimize the time that they spend in transit without sacrificing geographic distribution. Additionally, push yourself to dedicate as few resource to the backups as possible without scarifying functionality. There’s a lot of draws in lots of different ideas, but backups while important shouldn’t get in the way of actual work.
  • secure. Since you wouldn’t store your production data in an unsecured environment, don’t store your backups in environments with less security than your production systems. If you can restore systems using these backups, chances are other people can as well.
  • redundant. Make sure that you have a redundant backup plans. If one system fails and needs to restore from a backup, it’s incredibly likely that another system will fail, and storing data in multiple locations, can help protect you in theses situations.
  • consistent. Consistent state is import for quality backups. Ensure that all data is fully flushed to disk, and that you can capture all parts of the file, group of files, or file system at the same moment. Because of the constraints of input/output systems and backup utilities, it’s actually exceedingly difficult to capture a backup of an active system at a single moment in time. In many cases the amount of time that the back up operation takes to run doesn’t matter, but in some–like databases–it’s nearly impossible to create a good consistent backup without disrupting the system.

Hardware Concerns

Most of the backup strategies and tactics discussed in this article will address data loss from the perspective of file loss and corruption. Hardware redundancy is also important. Having spare components near where your servers are (or might be) and extra capacity in case of a demand spike or unexpected hardware failure is essential to being able to guarantee up-time. This infrastructure needs to be regularly tested and monitored, as if it were production. For less drastic changes these systems may also function as a development or test environment as long as you can roll back to known good states quickly.

Because of obsolescence cycles it makes a lot of sense to get too far ahead of yourself here. I keep two laptops/personal systems running at any time, because it’s crucial that I be able to have something that works nearby. I don’t travel with two laptops, but it means most of the time I’m within an hour of being up and running on the spare machine. I don’t keep spare server infrastructure around, because I’ve judged the chance of loss there to be pretty low, and even in the worst possible case, I can be up and running an hour after I notice the services are down. Not too bad.

Underlying Backup Tools

Backup tools are often quite mature and reliable. Backup tools are also remarkably stable and well tested, because reliability and stability are the basic fundamentals of backups. If you’re not familiar with the following tools, you ought to be:

  • tar is one of the canonical UNIX utilities. It’s is an archiving utility that takes a bunch of files and concatinates them in a specific way and stores them in a single file. GNU tar, is a particularly featureful implementation, and is often used in conjunction with a compression tool like gzip or bzip or xz to create “tarballs” or compressed archives. If you need to do any kind of file level backup or archiving, you should definitely be familiar with these tools.

    The fact that the compression tools and the archinving tools are separate is key. There are cases where files are already as compressed as they’re going to get, and there’s nothing gzip can do to make the file smaller so it’s not worth the processor overhead to use compression. Sometimes log files will be gzipped during log rotation. The independence of these two functions means that you can use shell redirection and the pipe (i.e. “|”) to control and modify output in a granular way. This is particularly powerful when combined with ssh to stream data to programs on foreign systems.

  • rsync is perhaps one of the most important things to emerge from the samba project, and it’s a staple of backup solutions. In short, rsync copies files in (nearly) the most efficient way possible. It examines the files located in the destination and only copies over the data that hasn’t changed. It’s good for incremental backups of all sorts, and all sorts of basic file synchronization tasks.

    Note

    It should go without saying, but rsync is in-ideal for system backups, because after the initial copy it doesn’t handle device nodes and some other file system objects correctly.

  • git isn’t a backup tool–exactly. Git is a version control system with some features that make for a good defacto backup tool for some kinds of files. Git’s storage engine is incredibly efficient and its easy to replicate git repositories on remote systems. Resist the temptation to do “everything possible” with git.

  • RAID and Storage Infrastructure. These aren’t backup tool as much as they’re mechanisms for redundancy. RAID adds redundancy to disk storage so that data won’t be lost if a disk fails. In some configurations, RAID can also increase the write throughput performance of disks.

    RAID alone is not a sufficient backup strategy, but RAID as a part of a redundant and highly available storage infrastructure is a necessary part of a fully developed backup and disaster prevention/recovery strategy.

  • LVM and some file systems provide a block-level snapshotting tool. Using a copy-on-write, these tools can capture the state of a “disk” at a particular moment, and by using pointers to the original data, they only require space equivalent to the changes to the file system. These tools are ideal for cases when you need to get a consistent state of an entire file system at a given moment. Use them to backup entire systems, or database servers’ data storage, or for system migrations, but avoid them for trivial file backups.

    Note

    File system snapshots, require disk space sufficient enough to hold the entire file system, and as a result you will end up backing up black space unless you take special precautions. Use tar and gz to create sparse files and avoid storing blank space.

    While snapshots are functioning file systems in their own right, they’re not usable as such: copy data off of snapshots onto distinct systems.

  • rdiff-backup (link <rdiff-backup>) is basically rsync on steroids, and makes it possible to effectively capture incremental backups of data simply. It’s particularly effective for the binary data (like images) that cannot be not effectively backed up using source control systems.

  • Obnam (link <obnam>) is an integrated backup solution that provides deduplication, encryption, file-level snapshotting, and a number of operational possibilities. If you have a good idea of your backup needs, and no particular interest in developing a system yourself, look at Obnam.

There many possible backup tools and technologies, many of which depend on some combination of the above components. When designing your own backup system, determine your requirements and build a solution using a collection tools, which may include some of the above tools.

Redundancy

In “Database Capacity and Scaling” various strategies for database-level redundancy. While it’s always a good idea to keep backups of known-good-states to protect against situations where an error, defect, or mistake propagates across an entire cluster of systems, in many cases if you can recreate or rebuild a server or instance from another instance or a collection of scripts, keeping an actual backup of the files or bit-for-bit data is less relevant.

Similarly, look to “High(er) Availability Is a Hoax” and think about backups as existing on a continuum with fault tolerance and redundancy, and consider your solutions to these problems as a whole rather than as two or three separate problems. By looking to address these problems together you will almost certainly save energy and probably some base cost as well.

Backup Types

This section provides an overview of the different kinds of data and systems that backups should support as well as the unique concerns that affect each type of backup. about taking quality backups of these resoucres .

Application Data

Application data, are specific data for a piece of software that is persistent across sessions and was not provided in the installation. Applications store data in specific formats, and expect data to be intact to function properly. Think about the content of a database-backed CMS, or bugs/issues from a bug tracker, or an email client’s email, or all of all of the data stored by your web browser.

In most cases, when you use the application, you’re not interacting with the data directly. Applications store data in files and in databases, and you should take backups directly from that storage system. However, some tools provide import-and-export facilities that you might want to test. In any case, the method you use to backup your software is likely different for every application.

Testing is important in every backup context, but particularly regarding application data. Ensure that from, in a clean environment, you can restore all functionality using only the data captured in your backup.

File Data

File data, or unstructured data, is all of the stuff that’s in files that sits on your file system. For me that’s text files, music, spreadsheets/Office documents, and an assortment of PDFs and (in time) EPUBs. Individuals often have a lot of file data, but most systems deployment have a relatively small amount of this kind of data. Usage is uneven, generally, but some small subset of files change regularly and most of the files are pretty static.

The key to successful file backups are in making sure that files exist in multiple locations (i.e. systems) and in making sure that you’re not wasting space by backing up the same files again and again. The main complication is that file backups generally need to be more-or-less accessible in their backed up state. The use cases for file level restores revolve around finding a few files or a few missing lines in a file. Full system level restores don’t capture this kind of granularity well, and if you back up files incrementally and regularly, you can run full-systems can run less regularly.

File level backups are mostly an organizational and workflow problem. Organize your files well, version things effectively and consistently, and figure out a way to avoid having lots of duplicate files, old versions, and other cruft. Unfortunately the best strategy depends on the files and the character of your work.

For most of the files on servers and configuration files in general, just use git, or some other source control system. Git may make sense for your personal files, but if you have lot of files that git doesn’t store efficiently, use rsync.

Configuration Data

Properly, configuration data is a subset of other types of data, usually applications and systems store their configuration data and settings in files. There are cases where these applications store configuration information in the main application database and its inseparable from application data.

Conceptually, configuration data is it’s own category. You, or your system, should audit and record every change to a configuration field or option. This is good security practice, and a lifesaver when a configuration change affects service and you have to roll back to a previous state. Having good configuration backups also makes it easy to deploy and redeploy new systems, based on existing configurations in less time with less effort and memory required.

Like file data and application data, every application is a bit different, but consider the following recommendations:

  • Assume that you’ll be running every application on more than one system and that the environments won’t be identical. Attempt to configure your systems so that “general” configuration and machine specific configurations, then store and backup those data separately.

    Use file-includes, or more complexly a templating system to generate multiple machine specific configuration systems for configuration files. Database systems are more difficult, but you can script them using common interfaces.

  • Thoroughly test accessing backup states and restoring. It’s easy to back data configuration up, but restoring it can be much more complex.

    Once you figure out how to back things up restore, automate a the restoration process. For example, create a make file and some helper scripts so that you can update your backup repository or download a tarball and run “make apply-config” and be totally restored on any machine. Also make sure that you can run “make backup-config” to do all of the copying, processing to your git repository or other backup system.

  • Consider using something like puppet or Chef to manage systems configuration deployments. See “Cloud Computing, Virtualization, and Automation” for more about system automation and configuration management.

Systems Data

Systems data are everything “backupable” on the system. Create systems backups so that, if a disk dies, or is currupted, or you accidentally run “rm -rf /” you’ll be able to restore the system quickly. It’s not enough to be able to reconstruct a running system, because you know that you have backups of your configuration, application, and file data. The process of rebuilding a system is something that always takes a while and is error prone: not what you want to be doing as crucial services are falling over. Systems backups aren’t effective or convient for any kind of incremental backups, and it’s difficult to use most whole-system backups to recover any specific file.

Traditionally, the best way to do systems wide backup is to use disk snapshotting and then copy the snapshot to a distinct phyisical location. These backups may be unwieldy, but if you have good backups of other application and configuration data, you can take system-level backups infrequently: every week, every month, or just when something signifigant changes.

A few years ago, “take backups from disk image backups,” would have entirely addressed the topic of systems backups.

More recently, other strategies have become more prevelent. Rather, as part of the “cloud computing” ethos, deployments have started having larger and larger numbers of smaller (virtualized) instances. Disk snapshots are effective for a small number of distinct systems but in more typical cloud environments they’re difficult to manage: in these situations administrators are using other strategies. Basically this boils down to:

  1. using machine templates. Essentially, rather than backing up all your machines, deploy infrastructure in such away that you can rebuild all your machine from a single backup.

    If your application runs in a multi-tier clustered environment, you can use sibling machines as templates. If you store most application and configuration data on different devices, and then mount those devices within the host system you may be able to make this effective in smaller environments.

  2. use deployement scripts or configuration managment tools like Chef or Puppet, to be able to automatically recreate and deploy your systems from miminal installations. It means a small amount of overhead and initial setup, but for deployments of more than a a few machines (and definitely more than a dozen, this is the preferable option for quick system restoration.

    If configuration management is you “systems backup story” you should be doing all provisioning and deployment using configuration management.

Above all, the goal of system backups is to be able to restore systems from backup quickly, when original systems are unavalible or inopperable due to any number of root causes. Test your backup system against this requirement, and if your solution satisfies the operational requirements, that’s probably enough.

Managing Backup Costs

Like most classic information technology problems, the way to have better backups is typically to “throw money at the problem.” It’s true: more money means more storage for keeping backups, greater redundancy, and improves the ratio between “amount of work” and “number of systems” so that backup operations have a smaller impact on performance and all failures impact a smaller portion of the system.

However, backups don’t need to be expensive, or out of reach to common computer users. Every system that saves state locally (including configuration data,) and does something of value should have some sort of disaster recovery and backup plan. In light of this near universal requirement, consider the following actions you can take to make backups more cost efficient:

  • Use compression whenever possible. Compress data before you transmit it (you can configure SSH to gzip all of its traffic.) Compress data before you store it: basic compression exchanges some CPU time for compressing and uncompromising data. Otherwise there are tools for reading the contents of compressed files as if they’re uncompressed (e.g. zcat, zless, zgrep, and emacs will transparently open and save zipped files.) The only downsides are: if you need to access a file regularly and your system is processor bound, compression can impact performance. For most type of data compression rarely saves as much space as you want.
  • Minimize expensive transit when possible. Don’t needlessly copy data between systems. Use tools like rsync to minimize the amount of data transited. If you can process, prune, and compress a backup before transmission on the host system without affecting performance process, do that. It’s important to move data off of a the host system, but transit takes time and at scale bandwidth has some real costs that might be worth mitigating.
  • Throw stale data away. There’s no use in keeping backup data that you’ll never use, never access, and never restore. The instinct, particularly for the compulsive data storage administrators, is to retain everything. This sounds good, and if you keep everything you don’t need to worry as much about being selective. But it’s horribly inefficient and a little bit of ingenuity as you begin to develop your backup strategy can save a lot in the long run.
  • Keep data well organized. Well organized data should be free of gross duplicates, old data is clearly archived and organized so that a backup of active data won’t capture older archived or achievable data. This mostly applies to file data, but can also refer to some application data and managing legacy, and ensuring that application data is easily segregated.
  • Prioritize and triage backup requirements. Different kinds of data have different backup needs and varying levels of importance. Beyond a certain threshold, what backups really save is time spent in disaster recovery. If the cost of creating and maintaining the backup over the mean-time-between-failures, is more than the cost to recover the systems and data, then backups are not worthwhile. Often this judgment is somewhat subjective.

Backup System Architecture

There are too many different kinds of requirements for any one backup system to sufficiently fulfill. Additionally, at the core, backup design is practice in balancing the paranoia and knowledge that systems will fail and mistakes will cause data loss with the pragmatic limitations of a budget, as well as limited resources for management and administrative costs. Any backup system needs to:

  • store data in multiple sites, and ensure that a single event (e.g. fire, earthquake, flood, power outage, network outage) will not render all your backups and infrastructure inoperable.
  • automate the backup process so that backups proceed without administrator intervention. Also, it’s ideal to take backups during off hours, and automated backup routines mean administrators get to sleep more, which is a definite plus.
  • have redundancy. Don’t go wild or overboard with the recursion, but a backup plan without a backup of its own is itself a single point of failure. Avoid these.
  • automate the restoration process. Usually when you must restore from backups, it’s because something unfortunate has happened. In these circumstances, you don’t want the restoration process to require an administrator to babysit the process.

Backup Methodologies

A large part of figuring out how to backup your data and systems depend on knowing where and how your applications store data in memory and on disk.

Disk Snapshots with LVM

If you’re not already managing your systems disk with some sort of logical volume manager consider it. Volume managers provide an abstraction layer for disk images and disks which allow you to move and re-size disks independently of physical disks. Volume managers also often have the facility to perform snapshots [1], which captures the exact state of a system in an instant and that in turn makes quality backups possible.

[1]Linux’s LVM (i.e. LVM2) has the limitation that snapshots must reside on the same physical disk as the original disk image, which has some minor impact on space allocation. Read your underlying system’s documentation.

In general snapshots are preferable for use in backups because they allow you to capture the contents of an in-use file system in a single instant; while this allows you to produce largely consistent backups of running systems, these backups are not terribly useful if you need to restore a single file.

When you create LVM snapshots it’s crucial that you move this data off of the system where you’re holding the snapshot. While snapshots may be useful in cases where you want to briefly capture a point-in-time image of the file-system, most backup applications require moving the LVM to a different storage format. Use a procedure that resembles the following:

lvcreate --snapshot --name snap0 --size 1G vg0/db0
dd if=/dev/snap0 | tar -czf sanp-`date %s`.tar.gz

To restore this backup, reverse this process:

lvcreate --size 10G vg0 db1
tar -xzf snap.tar.gz | dd of=/dev/vg0/db1

You can move the snapshot off as part of this process, by sending the output of dd to tar over SSH. Consider the following:

lvcreate --snapshot --name snap0 --size 1G vg0/db0
dd if=/dev/snap | ssh hostname tar -czf sanp.tar.gz

Reverse the procedure to restore as follows:

lvcreate --size 10G vg0 db1
ssh hostname tar -xzf sanp.tar.gz | dd of=/dev/vg0/db1
mount /dev/vg0/db1

Incremental Backups

Documenting specific routines are beyond the scope of this document, and given the variety of possible options and environments. However, consider the following suggestions:

  1. Remember that running file systems and applications can change while the backup process runs, which can lead to inconsistent state and corrupt backups. Create backups of systems that are “frozen” to as great of an extent as possible.

  2. Use rsync and rdiiff-backup, as possible. For rsync, use whatever options you prefer, of use the following form:

    rsync -curaz SOURCE DESTINATION
    

Lessons for Cyborgs

Backups are a daunting prospect and because backups are about balancing risk and cost, it’s often the case that no system is as backed up as it needs to be or could be.

Like most of systems administration work, the backups for any deployment or system is often a perpetually developing process. Backup work is important as part of every disaster recovery plan. Rather than allow yourself to become paralyzed by fear, remember that any progress is better than no progress.