Infrastructure Monitoring for Everyone

Overview

Monitoring refers to the practice of collecting regular data regarding your infrastructure in order to provide alerts both of unplanned downtime, network intrusion, and resource saturation. Monitoring also makes operational practices auditable, which is useful in forensic investigations and for determining the root cause of errors. Monitoring provides the basis for the objective analysis of systems administration practices and IT in general.

Collecting these data presents its own set of technological problems, and general purpose monitoring tools require a great deal of customization and configuration for most uses. At the same time, most specialized monitoring tools only collect certain types of data and must integrate into general purpose systems. There are no easy answers to these issues.

This document provides an overview of the monitoring problem domain and an introduction to the core strategies and technologies. After reading this document, I hope that you will have a reasonable understanding of the most important issues and concerns that face monitoring administrators and users.

Background

Monitoring applications and services are similar to other kinds of services and application in terms of reliability and redundancy requirements. See “High(er) Availability Is a Hoax” for more background on the trade offs between availability, performance, cost, and operational requirements. In some cases, a highly available, highly responsive monitoring system to track key production systems is absolutely required, but often monitoring systems have less significant requirements.

At the core, monitoring systems are simple tools that collect data generated by or collected from an existing system. Monitoring tools also include some analytic layer that condenses and correlates data. Monitoring and alerting are often addressed together because using the data collected by monitoring systems is one of the core applications of monitoring. Fundamentally, collecting and aggregating monitoring data is easy, interpreting the data and using monitoring systems is a much more complex and difficult project.

Monitoring systems have two general methods for collecting data: “passive systems,” where monitoring tools observe data created by the application and system under normal operation (i.e. logfiles, output, or messages from the application itself,) By contrast, “active systems use agents and tools that capture data or through a monitoring module integrated into the production system itself.

There advantages and disadvantages to both passive and active monitoring methods, and the kind of monitoring tools and data collection method you choose is very highly dependent upon the applications and environment active in your deployment, the specific needs, use patterns, and operational requirements. This is true of most systems administration problems to some degree, but it is particularly true of monitoring systems.

Key Concepts

Consider the following concepts in monitoring administration.

active monitoring
Monitoring systems that collect data by directly interacting with the monitored systems. Administrators must consider the impact (i.e. cost) of the monitoring and weigh this with the value of the test itself. For example, agent that tests the response time on a production database system are typically active tests.
alert
A notification regarding an event captured by a monitoring system produced when a data stream exceeds a pre-configured threshold. Alerts are often very configurable and allow a variety of operational configurations. Monitoring systems can send alerts to different tiers of administrators and various thresholds can trigger different kinds of alerts.
false negative
An event or alert that a monitoring system fails to detect. Tests that are not sensitive enough to deceit possible errors and tests which do not run at the right interval, or that detect errors of short duration cause these errors. False negatives are very serious and significantly impact the utility of a monitoring system.
false positive
An event or alert that lies beyond the monitoring threshold, but does not indicate that there is an operational issue. Monitoring infrastructure that is too sensitive or improperly configured causes these kinds of errors. Not only are false positives annoying, they decrease the effectiveness of other alerts because users are more likely to dismiss alerts that are true positives. At the same time, false negatives are a far more serious monitoring error.
hybrid monitoring
There are a class of monitoring collectors that fall somewhere between an active and passive tools, particularly depending on your perspective. ICMP pings, or sample page loads might fall into this category, but many may feel (strongly) that these hybrid methods are either active or passive. The final distinction is not particularly significant.
passive monitoring
Monitoring systems that collect data by reading data already generated by the monitored system. The system collects this data from logs/”traps” or from messages sent by the monitored system to a passive data collection agent. The “syslog system is an example of passive monitoring. Passive monitoring is significantly less resource intensive for the monitored system than other methods.
syslog
Refers the standard logging format that originated with early BSD Unix utilities (i.e. sendmail) and was later made generic for tool all system logging. A number of additional tools adopted the syslog format for reporting and log analysis and has is now a standard. These days, syslog is rather poorly utilized despite its ubiquity, with many applications using their own logging systems, or using the syslog and the syslog format in ways that go beyond the standard and intention of the system.
threshold
A configured setting outside of which administrators expect that a system cannot function. Thresholds must be “tuned,” to prevent false positives or false negatives.

Deploying Monitoring

Monitoring infrastructure should:

  • run as distinctly as possible from production services.
  • not create a significant impact on the system that it’s monitoring.

Failure of the monitored system should not cause a failure in the monitoring system. Simple redundancy and automatic fail-over is particularly important for monitoring systems, as it is important to “monitor the monitoring,” or ensure that an inoperative monitoring system doesn’t generate false positives.

See also

higher availability.”

Infrastructure and Architectures

Monitoring systems often consist of a central monitoring server that collects and aggregates the data from a series of agent or “probe” systems that are part of the monitoring infrastructure. These agents collect the data from the monitored system, and allow for a layer of redundancy and operational flexibility within the monitoring system. Agents make it possible to provide reasonable distributed scale for monitoring systems.

Monitoring infrastructure require redundancy and backup considerations. If you use your monitoring system for uptime monitoring, and the monitoring system goes down, it’s impossible to know what services are accessible. For transient outages, this isn’t a problem, but for longer site-wide infrastructure, having this level of monitoring is essential.

In some cases, reporting requirements demand “secondary” monitoring systems that fully replicate primary monitoring. More often, you can have just enough infrastructure to ensure that the primary monitoring node and other essential services are up, while retaining the core monitoring on primary. Some kinds of distributed architectures may also provide a necessary level of redundancy: rather than centralize all monitoring aggregation, have a collection of “probes” feed data into local collectors and data processing systems that do most of the work on a per-site basis while one or two “master systems” aggregate the data. Site-collectors can be redundant at whatever level your operational guidelines require.

As with all systems architecture, additional systems add complexity which increases the chance of failure. Tiered and distributed approaches to deploying systems and solutions are often the most robust and reliable, but they are the most difficult to configure and prone to error. While a distributed system may seem to solve most of your redundancy and recursive monitoring needs, there are enough hidden risks and complexities to indicate avoiding this kind of deployment unless absolutely necessary.

Alerts and Notifications

Alerts and notifications are the core uses of monitoring infrastructure, and likely the first or second service that you should configure in a new monitoring system. In most cases, you can summarize an alert as “when this metric passes outside of these boundaries make sure an administrator knows about it;” however, in practice there are many concerns and conditions that affect alert logic and behavior. Consider the following conditions and features:

  • Escalation. It’s not enough to simply send alerts: you need to ensure that someone acknowledges the alert and handles the recovery. Since people have “real lives,” and aren’t always on call, you need to be able to send alert to someone on the front lines, and if they cannot respond, pass that alert onto someone else.

    In some cases it’s possible to “fake” escalation with two alerts: send one message every minute, if the system has been down for five minutes, and one message every minute if the system has been down for more than fifteen minutes. This gives the front-line engineer ten minutes to disable the alert or fix the system before “waking someone else up.” In most cases, the second person will never get called.

  • High “signal to noise” ratio. It’s possible to turn on alerts for many different metrics, but this has the effect of “spamming” administrators, and decreasing the relative (perceived) importance of any given alert. If every alert that an on-call administrator receives is not crucial and actionable then the system is broken.

  • Some sort of on-call automation. Most systems have more than one administrator, and have some sort of administrator duty.

  • Compatible with multiple contact methods. In many cases, email is the lingua-franca for alert systems: it provides compatibility with SMS and Blackberry/Smartphones, and is incredibly portable.

    You must consider the delay between sending an alert and an administrator receiving and being able to respond to that alert when choosing alert methods. It’s useful to be able to configure logic when and where to send alerts on a per-user basis.

  • Configurable re-alerting. Depending on the service that the alert “covers,” an alert may need to resent after a certain period of time if the metric remains outside of the threshold.

  • When deploying alerts, consult with administrators on error responses, handling strategies, and average recovery times. Ideally, alerts will be able to cover their systems such that, administrators will have no need to routinely “check” a system covered by an alert.

Monitoring Tools

This section identifies the leading open source monitoring tools and attempts to catalog their functionality and purpose. If you have infrastructure that you need to monitor, you should be familiar with these tools.

Cacti
Cacti is a network traffic monitoring tool built on top of RRD. While Cacti is primarily used for collecting network utilization data it can accept data by way of the SNMP protocol. Cacti focuses on collecting a large amount of data from a large number of hosts and aggregating that data into a single coherent interface. Cacti, thus, is a data collection and aggregation framework.
Monit
Monit monitors (and supervises) specific processes, for Unix and Linux systems. Where other tools can provide data to answer a variety of different kinds of questions, Monit simply answers the question, “is this process up.” Monit works by directly spawning (as the init process does on most UNIX systems) the processes that it monitors, and is not distributed in normal operation. Such “uptime monitoring” is a very useful part of any deployment, but for critical infrastructure it’s important to collect additional data and monitor for additional infrastructure concerns (i.e. capacity and utilization) as well as larger trends and correlations.
Munin
Munin is a resource monitoring and data collection tool. It uses RRD to store and graph data. Munin can collect and display data from any kind of UNIX or UNIX-like host (including Mac OS X, and Linux.) Munin has no concept of “threshold” or “alert”, but can interact with other systems to provide this functionality. Munin operates with a “master” daemon that runs on one system and data collection nodes that must run on the monitored system. While the “master” node only needs to run on one system in an environment, all monitored systems must run the “munin-node” process.
Nagios
Nagios a generic monitoring framework that provides a sophisticated alert, notification, and data collection framework. With an extensive plugin framework, it’s possible to use nagios to monitor virtually any kind of system or operation using either passive or active techniques. Nagios has a primary monitoring node, that collects data from other agents and processes that run in a more distributed manner.

When choosing or deploying a monitoring solution, consider the following factors:

  • How does the platform collect data and what impact does this collection method have on the performance of the monitored system?
  • How many systems can the solution monitor and what kinds of resources does the tool require to support this level of service?
  • How much logical, physical, and/or network separation can the monitoring application get from the monitored application?
  • Can the platform provide alerts and notifications or must it integrate with another solution?
  • What monitors the monitoring system?
  • What kinds of issues and errors will the solution detect, and what kinds of situations is the solution unable to detect? (Network related problems, for instance, are extremely difficult to detect and monitor because monitoring applications are themselves network dependent to some degree.)

Appliances and Hosted Services

As in many domains, its possible to outsource monitoring to vendors who provide monitoring solutions as hosted services or as drop/plug-in appliances. While there are advantages and disadvantages to these and to conventional monitoring tools, outsourcing and “appliances” both release administrators from the additional burden of monitoring administering infrastructure and makes a certain amount of operational sense.

It makes sense to outsource monitoring for a number of reasons, including:

  • monitoring is mission critical, and if you’re working in a smaller organization, you’re probably not at expert at deploying monitoring tools, and you’re not in the business of monitoring (probably.)
  • monitoring systems ought to be distinct from the systems that they monitor. This allows the monitoring to remain operational throughout various service interruptions. This separation ought to cover both the actual infrastructure and the operation and maintenance of the production and monitoring systems.
  • doing monitoring right on your own can be quite expensive because the actual hardware, reliability, and data processing requirements are high, and a specialized monitoring vendor can often provide these services at great discount in price and time.

Feedback Loops

Technically speaking, installing, configuring, and tuning monitoring to report useful information is a huge challenge. While some non-systems administrators can make use of monitoring technologies in various contexts, [1] unlike most applications that systems administrators must deploy and maintain, monitoring systems’ primary users are systems administrators.

Figuring out how to use monitoring systems to better administer a group of systems. Monitoring makes it possible for a smaller number of systems administrators to administer a greater number of systems. [2] This section focuses on major applications for monitoring data, both in the automation of infrastructure and the analysis of data regarding that infrastructure.

[1]Alerts and notifications of various events have a number of different applications. Collected data can help justify budgets to people who aren’t involved in the administration (i.e. “business leaders.”)
[2]Efficiency in systems administration almost never results in a decrease of employment for actual systems administrators, but rather an ability for the existing or a modestly expanded workforce, to manage to expanded demands from business and developer units.

Automation

With the advent of cloud computing and automation surrounding virtualization, the use of monitoring solutions to underpin infrastructure deployment and management. The theory is, in essence, that as utilization varies between thresholds, additional capacity is automatically added and removed.

For example, if your application servers can effectively handle 1000 requests a second, you could trigger the following actions, using data polled every 5 or 10 minutes:

  • Add a node to the cluster when the average load equals 800 requests per second.
  • Remove a node from the cluster when the average load equals 400 requests per second.
  • Set a node to “not accept” new connections, (in the load balancer) if it has more than 1100 connections per second.
  • Alert if more than 4-8 application servers are running on any given instance.
  • To log and restart/redeploy application servers that have frozen or are no longer running.
  • To never remove an instance if there are fewer than three nodes.
  • To notify administrators (and escalate) if the automated system adds more than three nodes within an hour (say) or four nodes within 2 hours, to prevent runaway costs and malicious traffic floods.

The truth is, however, that this is a poor example. Application servers are easy to relate, but the truth is that most administrators will be able to have long (and busy) careers for successful clients and never have a situation where they’ll need to use more than 6-12 application servers, or need to deploy new application servers more than once a week. In most cases, traffic is predictable enough that “auto-scaling,” requires too much additional machinery for a relatively infrequent problem.

Nevertheless, there are parts of the above example that are useful for automating various kinds of capacity planning:

  • Establish thresholds and alerts to detect when there is too much excess capacity as well as insufficient capacity. It’s easy to scale up in response to additional load, but comparatively difficult to scale down. Scaling down is the part of automation that actually saves money.
  • Have the monitoring system tweak the load balancing settings. If a node looks like it’s in trouble or might become saturated, start moving traffic away from it until blocking tasks complete and it has additional capacity. This kind of tweaking is inefficient if you’re a human because it amounts to endless “knob twiddling,” but can be useful when automated.
  • Ensure that changes in capacity happen gracefully. Add additional capacity before you need it, and remove capacity after you’re sure that you no longer need it to maintain the current service level.

The process of developing automation around monitoring, evolves from the same process as tuning and deploying alerts. While there are some detectable events that require human intervention, you can automate most human responses to any given alert. Keep track of how you and your team resolves alerts, and then attempt to automate these tasks as much as possible.

Note

For a lot of capacity/throughput related tasks, often it’s more ideal maintain specific statefull infrastructure for data persistence (i.e. databases,) message bus/queuing systems, and automated tasks, but then do all “work” of the application layer in completely stateless systems “hanging” off of the message queue or queues. Examples of this may be media transcoding for a image or video gallery, or catalog and/or order management for an e-commerce site. Queues keep application logic simple while reducing the need for state-full systems and synchronous operation.

Obviously, however, this is a fundamental application design problem and something that’s outside of the bounds of systems administration. In that light, while the above “auto-scaling” script seems frightful, in many cases administrators will have to developed solutions like this in situations where developers could improve software.

Analytics

Monitoring systems are really just big data collection and aggregation frameworks. It’s important to have monitoring to track capacity usage and problems that can cause downtime, so that administrators can attend to these issues. However, when you have a system for collecting data and performing analysis there are other kinds of analysis that become quite easy. For example, you may:

  • figure out what areas or aspect of the system produces errors, or experiences poor performance. Then you can pass these messages as reports to the development teams. By integrating more closely engineering with teams, you can probably collect even more useful data.
  • identify trends in network usage and independently verify your providers’ services, particularly in a comparative context. This allows you to enter into contracts with more information and negotiate from a place of power.
  • correlate certain use patterns with each other, particularly regarding different aspects of a product. Use these data to suggest integration “higher up” in the engineering process. Systems administrations, are often primarily responsible for larger portions of a product that developers and can provide valuable feedback to engineering teams.

Monitoring Requirements

Consider the following set of minimum operational requirements for a functioning monitoring system:

  • Monitor everything. If it’s not important enough to monitor, the service or resource may not be providing value.
  • Monitor monitoring systems themselves.
  • Use different kinds of tests and tools to collect data to prevent measurement errors.
  • Ensure that alerts are useful and actionable by administrators.
  • Expect to spend time tuning and modifying monitoring frequency so that you’re not collecting too much data, or archiving data.

Without these features a monitoring system may not be worth the resources that it consumes.