Collecting Metrics

In my last blog post ‘Monitoring Riak’ I mentioned that I’d probably write about how to monitor RabbitMQ next. But that seemed very much like repetition and quite frankly it would probably have been boring. So I thought I’d do a higher level post about collecting metrics in general.

Setting the scene

At Outlyer we always try to frame our conversations with people around 4 main areas:

Collection
Routing
Visualisation
Alerting

It’s our strategy to become the best in all of those areas. This is actually very similar to what Caskey from Google outlines in his great talk, although I’ve simplified it a bit as I disagree about the configuration management aspect, and from our perspective we use other software for the data storage tier (e.g. Riak) so other people worry about making that awesome.

We believe that a good monitoring system pushes collection outside of its core. There are already tons of cool open source tools that do collection really well. CollectD, Diamond, JMXTrans, even Nagios scripts with performance data. Then you have all of the StatsD client libraries that can be used to send data about the performance of your app, and even from the browser with Bucky. Before you know it you have thousands of data points per second.

These metrics need to get streamed to somewhere, and that somewhere needs to scale. For traditional CPU, disk, memory and load you could get away with posting to a blocking API. But we’re in 2014, we’re surrounded by these awesome tools that let us stream thousands of metrics per second via UDP, and when you multiply that by hundreds of servers you quickly enter the world of big data. The parts designed to handle that stream of data are what we call the ‘routing’ components. Sensu does well at this with its queue based architecture and event handlers. Riemann is another awesome project that does similar things. A lot of people create multiple routes directly into backends like OpenTSDB or InfluxDB but then have to worry about how they alert off that data.

At Outlyer we had the luxury of designing everything from the ground up so we created a routing platform internally that scales horizontally and processes metrics as they come in so we can visualise and alert off them in one swoop. We got bored of having to create the spaghetti-mess Franken-monitor in previous monitoring projects, trying to join up things that were designed to process in real-time with those that were designed to poll.

Visualisation and alerting are large subjects on their own. Obviously, with hundreds of thousands of metric streams you start to encounter new problems. Like it not being possible to actually look at every graph. We’ll do a blog post at some point around how we’re helping to solve some of those limitations, and also a discussion about baselining and metrics prediction. If you’re interested in this area a good starter is Etsy’s cool Kale stack. We’re doing some similar things to help people make sense of the huge volume of data.

Collection

Whether we want to collect all of this data or not is a discussion to have elsewhere. We obviously believe that you want to collect everything possible. Simply because of the number of times that I’ve wanted to answer questions using data that wasn’t there. Turning on collection after the fact doesn’t help.

My personal opinion is that all collection tools should be open-source and conform to some kind of standard. We’ve picked what we think are the top 3 most prevalent standards:

Nagios (with performance data)
Graphite
StatsD

So, given we have picked those as our weapons of choice, how would we go about monitoring a server?

In our case we’d install the Outlyer agent which runs Nagios check scripts. We’re going to open source this once we come out of closed beta. Other options would be to install the NRPE agent or the Sensu agent. Then go hunt around Nagios Exchange for cool scripts. There are literally thousands to choose from. Nagios scripts with performance data will probably give you a handful of useful graphs. To be honest the point of the Nagios scripts isn’t really for metrics collection, they are more about boolean state change, but more graphs never hurt anyone.

Next we’d install something like CollectD (I like Diamond too, I just prefer CollectD). The default options will give you around 50 graphs of a whole range of operating system counters. Then you can start enabling a vast array of plugins. As soon as you do that you’ll start to get hundreds of application metrics. For example the previous blog gave us 130 individual metrics solely for Riak.

We’re up to a few hundred graphs by now for this one server. What’s next? If you’re running a Java app then JMXTrans is a must have. If you’re running software developed in-house then perhaps you’re developers are using something like the Coda Hale metrics library? Awesome if they are, stream those in via the Collectd curl-json plugin.

Most of our users are running online services and work in a DevOps world. Convince your developers to start using a StatsD client library inside their app. Real-time data in production detailing code performance and business transactions are amazingly useful. A good development team will decorate their code and you’ll get hundreds or even thousands of useful graphs out of your apps over time.

New Relic does a great job at introspecting various languages, and with the inclusion of a library automatically stream out metrics. The time is ripe for an open-source project that does the same thing but writes to Graphite for free. If nobody starts this project in the next 6 months we’ll probably create one at Outlyer and try to build a community around it.

So we’re up to potentially thousands of graphs already and we haven’t even got browser metrics into the system. The idea of getting thousands more data points from software like Bucky.js into graphite sounds like a good next step.

Is that the end? Actually probably not, your app might connect up to other online services. Mixpanel, Kissmetrics, Google Analytics, Amazon AWS counters, Send Grid, Mail Gun, a vast array of social app data, etc etc. These online services all provide API’s that expose more metrics that could be collected with simple Nagios scripts.

One of our power users, Jason, runs a few online shops (selling wedding dresses) and we did a quick experiment. He’s got a fairly normal LAMP stack running Wordpress. As a fun exercise we set ourselves the challenge of seeing how much data we could collect from just a single box. In the space of an hour we had 4,000 graphs streaming into Outlyer. Imagine how many graphs you get from 1,000 servers in production.

We’re living in a new world, the ‘information age’ and monitoring is getting extremely exciting.