Monitoring sucks, really

March 13, 2014 / Matthew Smith / 4 Comments

Have you noticed…

In short, all monitoring out there sucks. I promised a few months back to do a review, I was wrong, it is not possible. Let us consider the review of industry standard tools like Nagios, after only several hours of install I may have a server installed not in config management and no users or servers to monitor… This is why these type of on premise apps will die out.

Who wants to spend weeks working out the config and management of a system that is meant to make your life easier? Monitoring tools are very simply put, meant to let you know if serverX is on… or off.. advanced… details like on for service X, or off for service Y come later.

The basic monitoring life cycles should go like this.

Day 1 is the server on or off,
Day 2 are the services I care about running
Day 3 in X days Y may happen

These 3 things are important to monitoring, they allow you some predictability in your service so the sooner you have them the better. A good monitoring tool should be one that allows you to answer these questions as quickly as possible from the time you purchase / download it to the moment its on your server, quicker is better!

Bang for buck

I am acutely aware that monitoring tools that promise the world cost arms, legs, souls and pride; worse yet fail to deliver anything of value that you need. In the past I have seen £100k hp open view systems replaced in a couple of weeks by Nagios and I’ve seen Nagios + munin replaced by Opsview because it is easier to manage and config than both individual tools. For those that don’t know Opsview is a nice front end and config piece for nagios.

I have even, unfortunately seen £2k a month wasted on 10 servers with New Relic. I guess the point is… monitoring is anything from free to ridiculous the key is always what does it do for you?
Does it make your life easier?
Can you work quicker with or without your monitoring tool?

On a side note… New relic’s product is awesome, but if you are not using Java why bother, If you are, you may find like me your engineers find it useful but not irreplaceable… All I can say is it wasn’t as good as Nagios for the alerting and monitoring of the hosts but was definitely better at the application.

Where is the happy ground? You need something as configurable as Nagios, as cheap as Nagios but most importantly not Nagios and this leaves you in an awkward position.

Nagios is awesome and has some cool features, good support, many plugins etc. However the server doesn’t scale easily, configuration is not as simple as it should be and quite frankly the web UI looks like a child vomited hatred on it, just plain ugly. So you naturally lean to OpsView which takes away the config hassle of Nagios by providing puppet modules and decent web ui config but now you have to pay. Is it worth while? Definitely it’s better than Nagios, but that isn’t good enough is it? Certainly it’s a step in the right direction but it’s not the killer tool.

Likewise New Relic was meant to be that killer tool,
designed for devs by devs. So, in short, complicated, non standards compliant and lacking in os monitoring. So what is a sysadmin to do? Give up? I think not.

It comes down to this, you install tools like Opsview or CheckMK as they at least give you a better interface, but they don’t solve the issues of nrpe or firewall rules having to be opened in all directions. It’s for this reason I think there has to be a better way, I don’t want to think or spend my time opening up rules, I want something simple and powerful.

There’s new tools coming onto the market that to me sound better, imagine being able to leverage the Nagios community while having a easy to drive UI on a monitoring tool that gave you the same power as chef knife or puppet marionette while being able to update all of this through simple git commits or the web UI as you see fit. Writing a new monitoring check is done while in the analysis process rather than as backlog or you can simply utilise the RPC nature of the tool to debug issues in prod and write checks on the fly. Did I mention while doing this it is also able to act like Pingdom and provide dashboards to management.

So where does this leave us? well looking to tools like Dataloop.IO for solutions. I have had the privilege of using this while they are in closed beta and they’ve been really good at taking on feedback to make it the monitoring platform I need it to be and it is getting close to being ready and I’m genuinely excited about what is going to happen to this platform over the next year or two.