You may have heard the phrase “treat your servers like cattle and not like pets”.
A lot of people have embraced this mindset and the rise of configuration management tools has helped everyone to think about their servers being part of a specific environment and performing a particular role. We advise people to group their servers up into product, environment and role as this makes both deployment and monitoring vastly simpler. This way when you want some more capacity at peak times you can spin up a few more worker nodes in production for your service. If you need a new test environment, just click a button. Have a DR project to set up a cold standby in an alternative cloud? Easy.
So in a world where you supposedly don’t care about individual servers I’ve seen a few worrying trends start to emerge. There is a tendency, and this usually comes from groups with a predominantly development focused backgrounds, to think that you just need to throw a stream of time series data from a service at an endpoint. Then you can look for deviations in your graphs, build complicated functions and by keeping an eye on this you’re fully monitored. For those of us who have been supporting production services for a while we know this isn’t quite the full picture.
If we go back to basics, and ignore PaaS which has largely failed, everything has to run somewhere, you either have physical boxes or virtual machines (with or without containers on them). These servers are your building blocks. They determine the capacity available to you and even if you can rebuild them with the click of a button you need to know the current state of your infrastructure.
There are a few fundamental things you may want to know about your infrastructure. How many servers do you have? Are they all on? Has anyone wandered into the computer room, or logged into the admin console and turned anything important off? Questions, that are easily answered by something like Nagios or a long list of tools traditionally used by operations teams. Usually you’d do a ping check or hit an SSH or agent port. There is a defined polling interval and something that should always respond that is reliable. It seems realistic that if central polling fails then a server might be hosed.
What happens when you simply monitor using streams? Clearly you just setup ‘no data’ alerting and everything works perfectly. But when you think about it, what does ’no data’ actually mean? It simply means you didn’t get your metrics. Is that because there aren’t any metrics? What is the interval? Has your metrics sender died? (yes CollectD I’m looking at you). What happens when DNS dies and none of your metrics arrive?
Let’s assume that the metrics always get sent by a 100% reliable sender so that solves one problem but opens up a new one.
What about the alerting? I’ve stopped receiving data for a server, so I must have a server defined in my monitoring tool in order to create an alert rule to detect that. That’s easy, we’ll define a server so we can alert when metrics stop coming in from that host. We’re back to caring about servers again at this point, and by dealing only with streams of data (or no data) coming in we’re guessing if the server is down. We also have to do some complicated stuff when servers are supposed to scale up and down. It’s one thing to know when a server is alive or not, it’s another thing entirely to keep your monitoring tool up to date with rapidly changing environments and not receive a barrage of alerts.
At this stage what if we agree we don’t need to know the current state of our environment. We don’t care how many servers we have, how many are up or down, and our monitoring map doesn’t need to fit the terrain. All we care about is whether the service is up and performing well for users. Well, that’s a nice idea but what happens in real life, in large scale environments anyway, is you end up in a total mess. You hopefully spent some time load testing your app in the past and from those figures determined the capacity you’d need. You provision the boxes and then have no idea what’s really going on. The great thing about software is that it can be running fine one minute, then you look away and while nothing changed it stops working, you have hit some magical threshold and now you have issues. Having no clue what is going on and relying on ’everything is working for users right now’ is a recipe for disaster.
While nobody likes staring at a status page full of green and red servers, or receiving the Host Down! emails, they are actually quite useful. Even in a world where servers are cattle it is nice to be able to actually take a quick look at what went wrong before you shoot it in the head. Or even know you have some servers left to shoot in the head.
We took a view a long time ago that being able to answer the fundamental questions was quite important. Providing an accurate map of your environments as they change is important. For that reason a lot of complexity has been pushed into the Outlyer agent and we have various mechanisms for determining whether a server is alive or not. We have a built in DNS client and numerous other ‘keep alive’ failsafes designed to ensure we know exactly what is going on and try to get as close as possible to knowing if something is up or down. Not only that, we have presence detection by holding a websocket connection open. So we know instantly if there is a problem. Regardless of whether you use Outlyer or not though, putting a bunch of resiliency into knowing the state of a server is pretty important.
All of the complexity of managing what servers are registered at any given point is also handled by the Outlyer agent. We did this because we got fed up of how complex everything gets when you try to wrap up 15 year old technology in a band aid of config management. The complex automation that you would need to write side by side to update server alerting with purely stream based monitoring is taken care of by running agent commands so the server can register and de-register itself as they are spun up and down. We also designed agent fingerprinting so you can reliably tie metrics to hosts between rebuilds.
Metric sending is actually only one very small piece of functionality when compared to presence detection, registration, de-registration and fingerprinting.
What people generally do in the real world is run something like Nagios alongside Graphite. Separate systems joined together through custom scripts that attempt to layer on graphs to visualise and alert on the streams of data, on top of boolean checks. Unless you plan to do that with your stream processing tool then you’re missing half the picture.