The Virtues of Monitoring
A great explanation of the different levels of monitoring you could (and should) have in your application. (via Simon Willison)
As being responsible for deployment, performance tuning, monitoring, infrastructure has always been a part of many of my
job I thought it’d be about time to sprinkle some of my thoughts and daily ops thoughts on a couple of articles. The
simple reason being that no matter how much you try, no matter how far away from dealing with servers you go (think
Heroku), there will always be infrastructure, and it will always affect you and your application in some way.
On today’s menu: monitoring. People have all kinds of different meanings for monitoring, and they’re all right, because
there is no one way to monitor your applications and infrastructure. I just did a recount, and there are no less than
six levels of detail you can and probably should get. Note that these are my definitions, they don’t necessarily have
to be officially named, they’re solely based on my experiences. Let’s start from the top, the outside view of your
Availability is a simple measure to the user, either your site is available or it’s not. There is nothing in between.
When it’s slow, it’s not available. It’s a beautifully binary measure really. From your point of view, any component or
layer in your infrastructure could be the problem. The art is to quickly find out which one it is.
These aren’t necessarily metrics related to your application’s or infrastructure’s availability, they’re more along the
lines of what your users are doing right now, or have done over the last month. Think number of new users per day,
number of sales in the last hour, or, in our case, number of EC2 instances running at any minute. Stuff like Google
Analytics or click paths (using tools like Hummingbird, for example) in
general also fall into this category.
Digging deeper from the outsider’s view, you want to be able to track what’s going on inside of your application right
now. What are the main entry points, what are the database queries involved, where are the hot spots, which queries are
slow, what kinds of errors are being caused by your application, to name a few.
Going deeper (closer to inception than you think) from the application level we reach the processes that serve your
application. Application servers, databases, web servers, background processing, they all need a process to be
Another step down from processes we reach the system itself. CPU and memory usage, load average, disk I/O, network
traffic, are all traditional metrics collected on this level. The tools (both commercial and open source) in this area
can’t be counted. In the open source world, the main means to visualize these kinds of metrics is rrdtool. Many tools
use it to graph data and to keep an aggregated data history around, using averages for hours, days or weeks to store the
The beauty about these tools is that you can throw any metric at them you can think of. They can even be used to collect
business level data, utilizing the existing graphing and even alerting capabilities.
The much dreaded log file won’t go out of style for a long time, that’s for sure. Your web server, your database, your
Rails application, your application server, your mail server, all of them dump more or less useful information into log
files. They’re usually the most immediate and uptodate view of what’s going on in your application, if you chose to
actually log something, Rails appliations traditionally seem to be less of a candidate here, but your background
services sure are, or any other service running on your servers. The log is the first to know when there’s problems
delivering email or your web server is returning an unexpected amount of 500 errors.
I’m not suggesting you need every single kind of logging, monitoring and metrics gathering mentioned here. There is
however one reason why eventually you’ll want to have most if not all of them. At any incident in your application or
infrastructure, you can correlate all the available data to find the real reason for a downtime, a spike or slow
queries, or problems introduced by recent deployments.