Fancy Graphs From Nagios Plugins

Everyone loves graphs. Our brains seem to be naturally tuned for looking at wavy lines on a screen. As humans we can instantly detect patterns and even make predictions that would beat some of the best machine learning algorithms. I’ve heard stories of people being employed solely to look at graphs. This is all good stuff, but the point of this post is to show how you can get the data in the first place from your Nagios plugin scripts. Think of this post as a follow-on from Beautiful Nagios Scripts.

Last time we wrote a plugin that did the same thing as check_http. This time we’ll take that same script and make it output something called ‘performance data’. This is defined in the glorious Nagios Plugin Development Guidelines

Performance Data

For those who dislike the size 8 font of the dev guide I’ll paste the details here. Essentially whereas before we ran our plugin and it printed OK to the screen, we’re now going to add a pipe character after that and then some text in the following format..

'label'=value[UOM];[warn];[crit];[min];[max]

That may look a little scary but actually when it comes down to it you could update your Nagios plugin to output something like OK | something=100;;;; and most graphing engines would draw a point at 100 every time the script runs.

Outlyer certainly would because we support even the laziest of non-spec adhering scripts. We’ve stayed up late at night trying to guess what people will try to do on our app and still try to support some level of craziness.

But what do the other bits do? They define additional context around your data. The UOM for instance is the Unit of Measure and this supports time (in seconds and other variants), percentage, Bytes (and bigger) and a continuous counter (I’ve never used this but it sounds fun). You can also choose just to ignore the UOM completely, although if you do provide it you should end up with prettier graphs.

The other bits between the ; characters define the thresholds and minimum and maximum values. You might want to plot an orange line on your graphs at the warn level and a red line on your graph at the crit level. Then bound the lowest and highest numbers possible in order to keep a sane scale on your X and Y axis. That would give you a really nice graph for management to see when certain SLA’s are being broken visually. These are all useful but feel free to ignore them if it makes sense.

Updating our beautiful script

Ok, perhaps it wasn’t all that beautiful but I was always told that you’re supposed to be emotional in blogs. As a recap we ended up with a Python and Ruby script that could detect when one of my wordpress blog posts was down. The Python version looked like this:

#!/usr/bin/env python
import sys
import requests

try:
    check_url = 'http://blog.outlyer.com/2013/10/26/notes-from-monitorama-eu/'
    html_content = requests.get(check_url).content

    if 'I recently attended Monitorama' in html_content:
        print "OK"
        sys.exit(0)
    else:
        print "FAIL! Content not found."
        sys.exit(2)
except Exception, e:
    print "FAIL! %s" % e
    sys.exit(2)

What we really want to do is change the line that says print “OK” to also display the number of seconds it took to load the blog post. As we know that we may as well fill in that bit right away:

print "OK | response_time=%ss;;;;" % response_time

That small change means that we now only need to set the variable response_time to however long it takes the page to load. Since we’re not monitoring anything critical like tremors in an earthquake detection system or anything medical that might endanger lives we can just record the time in seconds before the request is made, then again after and the difference is how long it took. This is how it might look..

#!/usr/bin/env python
import sys
import requests

try:
    check_url = 'http://blog.outlyer.com/2013/10/26/notes-from-monitorama-eu/'
    html_content = requests.get(check_url).content

    if 'I recently attended Monitorama' in html_content:
        print "OK | response_time=%ss;;;;" % html_content.elapsed.total_seconds()
        sys.exit(0)
    else:
        print "FAIL! Content not found."
        sys.exit(2)
except Exception, e:
    print "FAIL! %s" % e
    sys.exit(2)

That was pretty easy. With an extra bit of effort on top of the up / down script you had before you now have response times so you can see when your site is running slowly, or in our case wordpress. For those interested when I ran this they were all around the 0.9 second mark.

OK | response_time=0.940515995026s;;;;

Wrapping up

The more you write Nagios plugins in your preferred scripting language the easier they become. Over time you can whip something like the above script up in a few minutes and it provides another curvy line to look at when making decisions such as “should I stay on wordpress?”.

In subsequent blog posts I’ll pick a few real world examples of things to measure and start writing some more advanced scripts. If you’re interested in a monitoring solution that lets you create Nagios plugins easily and quickly, that alerts you when your systems fail, and that draws you those pretty graphs from your graph data please sign up at www.outlyer.com