I recently attended Monitorama EU and wrote up some notes that some people may find interesting. These were my takeaways from last weeks conference and the main points that seemed to resonate:
Monitoring is still only as good as the people implementing it.
If you have spent months on your monitoring setup, and you have already predicted failure scenarios and created tests for them, then you’ll get messages prior to something going down. If you haven’t then there is nothing out there that will do it for you.
All of the machine learning and clever algorithms for predicting failure from trends don’t work in reality.
The one caveat is if you know the context of the data. For a very narrow band of checks (disk space and temperature being only two examples found) you can have some success using an algorithm. In most other cases you have to review every single graph and try to fit one or many of the models.
Nobody had any silver bullets or suggested stacks
The guy who ran the Obama for America campaign said that they just implemented everything they could that would give them feedback. They had literally hundreds of monitoring systems. Yet they still put a LOT of faith in feedback from ‘Power Users’. They would give certain users access to personal dashboards so they could monitor graphs. This worked exceptionally well as people would call up when they had a ‘gut feeling’ things were going bad. This typically resulted in the engineers digging into code and finding issues. Their culture of engineers proactively looking for bugs based on power user ‘hunches’ without requiring repeatable test cases was quite refreshing.
Empowering the Power Users
As mentioned above these guys were seen to be key. Certain people naturally rise up and become a good monitoring system. These are typically the ones who call up or put messages onto Skype before any alarm bells start ringing. They should be used as an asset and given dashboards and access to logs and any other info they would be willing to consume.
A suggestion was made to add perceptional diffs to monitoring. Google made a P-Diff tool last year that screenshots the UI and compares the pixels between releases. This can detect the edge cases where you accidentally change the CSS on the Buy Now button and turn it white on a white background, and then don’t notice until your sales plummet.
Monitor more things
Some funny anecdotes about running sentiment analysis on developer commit messages to measure happiness. Also integration with Fitbit to see how healthy people on the team are.
Another cool angle was monitoring security by wrapping existing command line tools in Nagios check scripts that can detect things like rootkits (i.e. RKhunter), open ports (nmap). AuditD was seen as a way to get very granular with the ‘security monitoring’.
Log monitoring was seen as a massive pain point. One suggestion was to use FluentD to parse logs into JSON format and use that data as the starting point for analysis. The main problem seems to be that there are too many log formats and parsers. Symantec logging seems like a long way off.
As well as integrating Security into every day monitoring some also pitched adding QA scripts into the system too (with an oddly named QOS label). Integration testing means : is it good for production, whereas QOS scripts mean: is it good in production. I can see the value of running tests all of the time between environments and then comparing the results from time to time. One thing that is required for this is the ability to replace production traffic (and ideally state).
Configuration Management tools need more work
Some pain points were around context i.e. improved audit trails, tags and annotations for changes on systems. Being able to compare states between hosts. Putting change activity into a human context (i.e. who / what / when). Also being able to detect and flag ‘drift’ which is something we deal with by destroying boxes and building from scratch each time.
Monitoring as a service
A few people mentioned that Ops should be providing ‘monitoring as a service’ to the entire business. Many groups like to base decisions off real data (product management, finance etc). Opening up systems, application and business metrics in the context of a SaaS product makes sense. It was also noted that if this is done effectively users will ask questions that Ops had never thought about which leads to better monitoring overall.
For example developers need metrics for db queries, page views, js errors) and product owners need data on content creation, content quality etc.
Some other points pain points that I jotted down (ping me if any of these don’t make sense):
- Good monitoring improves communication between groups
- Make decisions based on data and not gut feeling (although gut feeling is great for digging into an issue)
- Removes blame
- Brains prefer graphs to numbers
- Helps move from monolithic to SOA and continuous deployment
- Complex systems show emergent behaviour
- Fixed thresholds aren’t good enough for alerts.
- Distinction between attention and alerts.
- Nice to have more data on attention even if not relevant.
- False positives are bugs. They reduce trust in the system.
Summary
Overall I got the impression that we’re on the right track. I think the age of Ops monitoring things in a silo are over and the benefits of opening up our tooling to other groups could see some sizeable benefits.