Darrell Hudson

Clipping interesting articles & photos

The Virtues of Monitoring

A great explanation of the different levels of monitoring you could (and should) have in your application. (via Simon Willison)

Amplify’d from www.paperplanes.de

As being responsible for deployment, performance tuning, monitoring, infrastructure has always been a part of many of my
job I thought it’d be about time to sprinkle some of my thoughts and daily ops thoughts on a couple of articles. The
simple reason being that no matter how much you try, no matter how far away from dealing with servers you go (think
Heroku), there will always be infrastructure, and it will always affect you and your application in some way.

On today’s menu: monitoring. People have all kinds of different meanings for monitoring, and they’re all right, because
there is no one way to monitor your applications and infrastructure. I just did a recount, and there are no less than
six levels of detail you can and probably should get. Note that these are my definitions, they don’t necessarily have
to be officially named, they’re solely based on my experiences. Let’s start from the top, the outside view of your

Availability Level

Availability is a simple measure to the user, either your site is available or it’s not. There is nothing in between.
When it’s slow, it’s not available. It’s a beautifully binary measure really. From your point of view, any component or
layer in your infrastructure could be the problem. The art is to quickly find out which one it is.


Business Level

These aren’t necessarily metrics related to your application’s or infrastructure’s availability, they’re more along the
lines of what your users are doing right now, or have done over the last month. Think number of new users per day,
number of sales in the last hour, or, in our case, number of EC2 instances running at any minute. Stuff like Google
Analytics or click paths (using tools like Hummingbird, for example) in
general also fall into this category.

Application Level

Digging deeper from the outsider’s view, you want to be able to track what’s going on inside of your application right
now. What are the main entry points, what are the database queries involved, where are the hot spots, which queries are
slow, what kinds of errors are being caused by your application, to name a few.

New Relic

Process Level

Going deeper (closer to inception than you think) from the application level we reach the processes that serve your
application. Application servers, databases, web servers, background processing, they all need a process to be

Infrastructure/Server Level

Another step down from processes we reach the system itself. CPU and memory usage, load average, disk I/O, network
traffic, are all traditional metrics collected on this level. The tools (both commercial and open source) in this area
can’t be counted. In the open source world, the main means to visualize these kinds of metrics is rrdtool. Many tools
use it to graph data and to keep an aggregated data history around, using averages for hours, days or weeks to store the
data efficiently.


The beauty about these tools is that you can throw any metric at them you can think of. They can even be used to collect
business level data, utilizing the existing graphing and even alerting capabilities.

Log Files

The much dreaded log file won’t go out of style for a long time, that’s for sure. Your web server, your database, your
Rails application, your application server, your mail server, all of them dump more or less useful information into log
files. They’re usually the most immediate and uptodate view of what’s going on in your application, if you chose to
actually log something, Rails appliations traditionally seem to be less of a candidate here, but your background
services sure are, or any other service running on your servers. The log is the first to know when there’s problems
delivering email or your web server is returning an unexpected amount of 500 errors.

I’m not suggesting you need every single kind of logging, monitoring and metrics gathering mentioned here. There is
however one reason why eventually you’ll want to have most if not all of them. At any incident in your application or
infrastructure, you can correlate all the available data to find the real reason for a downtime, a spike or slow
queries, or problems introduced by recent deployments.

Read more at www.paperplanes.de



Single Post Navigation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: