Monitoring refers to the practice of collecting regular data regarding your infrastructure in order to provide alerts both of unplanned downtime, network intrusion, and resource saturation. Monitoring also makes operational practices auditable, which is useful in forensic investigations and for determining the root cause of errors. Monitoring provides the basis for the objective analysis of systems administration practices and IT in general.
Collecting these data presents its own set of technological problems, and general purpose monitoring tools require a great deal of customization and configuration for most uses. At the same time, most specialized monitoring tools only collect certain types of data and must integrate into general purpose systems. There are no easy answers to these issues.
This document provides an overview of the monitoring problem domain and an introduction to the core strategies and technologies. After reading this document, I hope that you will have a reasonable understanding of the most important issues and concerns that face monitoring administrators and users.
Monitoring applications and services are similar to other kinds of services and application in terms of reliability and redundancy requirements. See “High(er) Availability Is a Hoax” for more background on the trade offs between availability, performance, cost, and operational requirements. In some cases, a highly available, highly responsive monitoring system to track key production systems is absolutely required, but often monitoring systems have less significant requirements.
At the core, monitoring systems are simple tools that collect data generated by or collected from an existing system. Monitoring tools also include some analytic layer that condenses and correlates data. Monitoring and alerting are often addressed together because using the data collected by monitoring systems is one of the core applications of monitoring. Fundamentally, collecting and aggregating monitoring data is easy, interpreting the data and using monitoring systems is a much more complex and difficult project.
Monitoring systems have two general methods for collecting data: “passive systems,” where monitoring tools observe data created by the application and system under normal operation (i.e. logfiles, output, or messages from the application itself,) By contrast, “active systems use agents and tools that capture data or through a monitoring module integrated into the production system itself.
There advantages and disadvantages to both passive and active monitoring methods, and the kind of monitoring tools and data collection method you choose is very highly dependent upon the applications and environment active in your deployment, the specific needs, use patterns, and operational requirements. This is true of most systems administration problems to some degree, but it is particularly true of monitoring systems.
Consider the following concepts in monitoring administration.
Monitoring infrastructure should:
Failure of the monitored system should not cause a failure in the monitoring system. Simple redundancy and automatic fail-over is particularly important for monitoring systems, as it is important to “monitor the monitoring,” or ensure that an inoperative monitoring system doesn’t generate false positives.
See also
Monitoring systems often consist of a central monitoring server that collects and aggregates the data from a series of agent or “probe” systems that are part of the monitoring infrastructure. These agents collect the data from the monitored system, and allow for a layer of redundancy and operational flexibility within the monitoring system. Agents make it possible to provide reasonable distributed scale for monitoring systems.
Monitoring infrastructure require redundancy and backup considerations. If you use your monitoring system for uptime monitoring, and the monitoring system goes down, it’s impossible to know what services are accessible. For transient outages, this isn’t a problem, but for longer site-wide infrastructure, having this level of monitoring is essential.
In some cases, reporting requirements demand “secondary” monitoring systems that fully replicate primary monitoring. More often, you can have just enough infrastructure to ensure that the primary monitoring node and other essential services are up, while retaining the core monitoring on primary. Some kinds of distributed architectures may also provide a necessary level of redundancy: rather than centralize all monitoring aggregation, have a collection of “probes” feed data into local collectors and data processing systems that do most of the work on a per-site basis while one or two “master systems” aggregate the data. Site-collectors can be redundant at whatever level your operational guidelines require.
As with all systems architecture, additional systems add complexity which increases the chance of failure. Tiered and distributed approaches to deploying systems and solutions are often the most robust and reliable, but they are the most difficult to configure and prone to error. While a distributed system may seem to solve most of your redundancy and recursive monitoring needs, there are enough hidden risks and complexities to indicate avoiding this kind of deployment unless absolutely necessary.
Alerts and notifications are the core uses of monitoring infrastructure, and likely the first or second service that you should configure in a new monitoring system. In most cases, you can summarize an alert as “when this metric passes outside of these boundaries make sure an administrator knows about it;” however, in practice there are many concerns and conditions that affect alert logic and behavior. Consider the following conditions and features:
Escalation. It’s not enough to simply send alerts: you need to ensure that someone acknowledges the alert and handles the recovery. Since people have “real lives,” and aren’t always on call, you need to be able to send alert to someone on the front lines, and if they cannot respond, pass that alert onto someone else.
In some cases it’s possible to “fake” escalation with two alerts: send one message every minute, if the system has been down for five minutes, and one message every minute if the system has been down for more than fifteen minutes. This gives the front-line engineer ten minutes to disable the alert or fix the system before “waking someone else up.” In most cases, the second person will never get called.
High “signal to noise” ratio. It’s possible to turn on alerts for many different metrics, but this has the effect of “spamming” administrators, and decreasing the relative (perceived) importance of any given alert. If every alert that an on-call administrator receives is not crucial and actionable then the system is broken.
Some sort of on-call automation. Most systems have more than one administrator, and have some sort of administrator duty.
Compatible with multiple contact methods. In many cases, email is the lingua-franca for alert systems: it provides compatibility with SMS and Blackberry/Smartphones, and is incredibly portable.
You must consider the delay between sending an alert and an administrator receiving and being able to respond to that alert when choosing alert methods. It’s useful to be able to configure logic when and where to send alerts on a per-user basis.
Configurable re-alerting. Depending on the service that the alert “covers,” an alert may need to resent after a certain period of time if the metric remains outside of the threshold.
When deploying alerts, consult with administrators on error responses, handling strategies, and average recovery times. Ideally, alerts will be able to cover their systems such that, administrators will have no need to routinely “check” a system covered by an alert.
This section identifies the leading open source monitoring tools and attempts to catalog their functionality and purpose. If you have infrastructure that you need to monitor, you should be familiar with these tools.
When choosing or deploying a monitoring solution, consider the following factors:
As in many domains, its possible to outsource monitoring to vendors who provide monitoring solutions as hosted services or as drop/plug-in appliances. While there are advantages and disadvantages to these and to conventional monitoring tools, outsourcing and “appliances” both release administrators from the additional burden of monitoring administering infrastructure and makes a certain amount of operational sense.
It makes sense to outsource monitoring for a number of reasons, including:
Technically speaking, installing, configuring, and tuning monitoring to report useful information is a huge challenge. While some non-systems administrators can make use of monitoring technologies in various contexts, [1] unlike most applications that systems administrators must deploy and maintain, monitoring systems’ primary users are systems administrators.
Figuring out how to use monitoring systems to better administer a group of systems. Monitoring makes it possible for a smaller number of systems administrators to administer a greater number of systems. [2] This section focuses on major applications for monitoring data, both in the automation of infrastructure and the analysis of data regarding that infrastructure.
[1] | Alerts and notifications of various events have a number of different applications. Collected data can help justify budgets to people who aren’t involved in the administration (i.e. “business leaders.”) |
[2] | Efficiency in systems administration almost never results in a decrease of employment for actual systems administrators, but rather an ability for the existing or a modestly expanded workforce, to manage to expanded demands from business and developer units. |
With the advent of cloud computing and automation surrounding virtualization, the use of monitoring solutions to underpin infrastructure deployment and management. The theory is, in essence, that as utilization varies between thresholds, additional capacity is automatically added and removed.
For example, if your application servers can effectively handle 1000 requests a second, you could trigger the following actions, using data polled every 5 or 10 minutes:
The truth is, however, that this is a poor example. Application servers are easy to relate, but the truth is that most administrators will be able to have long (and busy) careers for successful clients and never have a situation where they’ll need to use more than 6-12 application servers, or need to deploy new application servers more than once a week. In most cases, traffic is predictable enough that “auto-scaling,” requires too much additional machinery for a relatively infrequent problem.
Nevertheless, there are parts of the above example that are useful for automating various kinds of capacity planning:
The process of developing automation around monitoring, evolves from the same process as tuning and deploying alerts. While there are some detectable events that require human intervention, you can automate most human responses to any given alert. Keep track of how you and your team resolves alerts, and then attempt to automate these tasks as much as possible.
Note
For a lot of capacity/throughput related tasks, often it’s more ideal maintain specific statefull infrastructure for data persistence (i.e. databases,) message bus/queuing systems, and automated tasks, but then do all “work” of the application layer in completely stateless systems “hanging” off of the message queue or queues. Examples of this may be media transcoding for a image or video gallery, or catalog and/or order management for an e-commerce site. Queues keep application logic simple while reducing the need for state-full systems and synchronous operation.
Obviously, however, this is a fundamental application design problem and something that’s outside of the bounds of systems administration. In that light, while the above “auto-scaling” script seems frightful, in many cases administrators will have to developed solutions like this in situations where developers could improve software.
Monitoring systems are really just big data collection and aggregation frameworks. It’s important to have monitoring to track capacity usage and problems that can cause downtime, so that administrators can attend to these issues. However, when you have a system for collecting data and performing analysis there are other kinds of analysis that become quite easy. For example, you may:
Consider the following set of minimum operational requirements for a functioning monitoring system:
Without these features a monitoring system may not be worth the resources that it consumes.