When it comes to service status pages most of us feel they are more of a marketing gimmick than fact. For example with Amazon Web Services the first time you are aware of a problem it is not from the status page. It is when twitter sets on fire with people complaining about the poor service. The trend is alarming and it is not just Amazon doing it, almost all service providers do the same thing. For some reason special authorisation is required to update the status page. Special people need to confirm that this is the right marketing move for the business. That’s not how we work.
People need trust in a service. People want to feel like they are getting the information as and when it happens. Not 30 minutes or 40 minutes later, if at all. That is where status.io comes in for us. We needed a way to communicate with our users how our service was doing and we can do that through status.io. Using the Outlyer platform, I wrote a statusio plugin that checks http and tcp endpoints. It then reports back if they’re working or not. Every 30 seconds.
Let’s take a look at what V1 does. (You can find it our github plugin repo: here)
This is only the first version and it has got some flaws. The main 2 flaws being it does not time the tcp connections yet and it does not update the metrics on the status.io page yet. But yet it does work, just. All the config for this plugin is at the top:
[sourcecode language="python"]
\#Config - change these bits, hope for best.
api_id = "your-api-id"
api_key = "your-api-key"
statuspage_id = "your-status-page-id"
checks = {
'endpoint 1': { 'id': "component-id", 'check_type': 'url', 'target' : 'https://agent.outlyer.com'},
'endpoint 2': { 'id': "component-id", 'check_type': 'tcp', 'target' : 'graphite.outlyer.com:2003'},
}
[/sourcecode]
After that, the application goes off and determines the containers that each component belong. It will carry out the relevant http or tcp check for each. If any one of the checks fail it will update that component (and it’s containers) on status.io. It will change the status of the component to say there has been an issue. This in turn will make it so when you look at the status page: here what you see is the current health of the platform. Currently the statusio plugin is looking for maintenance mode. If the maintenance mode is in effect it will not update the status until the maintenance mode has been completed. With this feature we can add a planned maintenance via the API or the status.io website and the plugin will not update it.
One of the reasons we chose status.io to host our status page was that it was quick and easy. We were able to get something like this up and working in hours. Whenever we trigger an incident or the planned maintenance it takes care of notifying our users. The users are happy, and we’re happy, everyone is happy.
Now, to drive these checks we make use our plugin distribution and scheduling system. After all, they are just scripts that run on an interval every 30 seconds. This does mean that some transient issues are currently lost, But you can then produce something like this:
There are a few issues in the current version so the plan is to do the following for the next version:
- Add TCP timer
- Add Metric update via status.io api
- Add Incident Raise / Resolve
Once that is there and stable we will trial it on our own status page before updating the plugin in the plugin repo. There would be no point in writing all this and then keeping the value locked away, so share and share alike! All you will need to do is drag the statusio plugin onto a tag that is applied to a single server and fill in the config. Then you too will have a cool status page that will magically update.
Once it’s running you can then create pretty dashboards like this to show the response times of the end points.