Flexible monitoring, going up and down

The other day…

I wrote a post the other week about how much monitoring sucked and there was a number of people on the internet (hello people) that just didn’t get it so I thought more detail would be good. One point that was raised was about the scaling up and down of servers and how that affected the monitoring platform. I wanted to cover this specifically as it is an important topic to understand why I said I think Dataloop.IO was the answer.

Nagios + Puppet

Lets look at a typical Puppet / Nagios approach. Puppet has the concept of exported resources, an exported resource can be collected by another server and then actioned so a cool thing to do is to have a manifest that describes a webserver that looks like this:

# /etc/puppetlabs/puppet/modules/nagios/manifests/target/apache.pp
class nagios::target::apache {
   @@nagios_host { $fqdn:
        ensure => present,
        alias => $hostname,
        address => $ipaddress,
        use => "generic-host",
   }
   @@nagios_service { "check_ping_${hostname}":
        check_command => "check_ping!100.0,20%!500.0,60%",
        use => "generic-service",
        host_name => "$fqdn",
        notification_period => "24x7",
        service_description => "${hostname}_check_ping"
   }
}

The double @ tells puppet to send this resource to the puppet database where something looking for it can pick it up later, so the configuration needed to define a host and to add a ping check is. Once the resource is exported it waits on the server until it is collected, the collection looks like this:

# /etc/puppetlabs/puppet/modules/nagios/manifests/monitor.pp
class nagios::monitor {
    package { [ nagios, nagios-plugins ]: ensure => installed, }
    service { nagios:
        ensure => running,
        enable => true,
        #subscribe => File[$nagios_cfgdir],
        require => Package[nagios],
    }
    # collect resources and populate /etc/nagios/nagios_*.cfg
    Nagios_host <<||>>
    Nagios_service <<||>>
}

The Spaceship (<<||>>) tells Puppet to look for that defined resource in the exported resources puppet database, so in this case a resource of Nagios_host or Nagios_service. This is cool, it means a server that previously had no information about another can now do something useful with the specific information that server now provides. This is a good fit for adding new hosts or service checks to Nagios, so lets look at how you remove them next:

N/A

Seriously… If you want to remove it you would have to do the following, reconfigure the host in puppet so it no longer exports, then purge the DB of previous exports, then re-run puppet on the nagios server to re-add all resources again except the one you removed… sounds fun, you could probably make it work if you knew the server was going to be shutdown. If you don’t believe me see this That’s as good as it gets, sorry.

The real problem

With the uptake of utility based computing servers come and go and we should no longer be precious about them. I always give the same answer when someone in the team asks what we call the new server.

These are farm animals not pets

What do I mean by that? well I don’t care what it’s called or even if it exists, if it causes me any problems I will shoot it in the head and get a new one. Lets look at webservers in an auto scaling group, I sometimes have 3, sometimes 3000. Trying to manage that flexibility in puppet will work for scaling up, and I’m sure there’s a way to manage the scale down (if anyone has a way I’d be interested in hearing it)

So why is Dataloop.IO better? well I think it’s better because I can draw a simple hierarchy in the web UI and take a tag, say ‘web’ and add it to the ‘web servers’ service. When I now install Dataloop.IO using puppet or chef or the setup.sh method I have to provide a few details an API Key and an optional tag or list of tags. So assuming that the configuration is done correctly there will be a ‘web server’ role that all web servers collect from and I just put the tag in there and hay presto the server(s) connect to Dataloop.IO in the right container and then they download all of their checks. Lets cover a few examples:

name "web"
description "Web server Role for configuring servers"
run_list(
  'recipe[apache]',
  'recipe[dataloop]'
)
default_attributes(  { "dataloop" =>
                          { "agent" => {
                              "api_key" => "someapikey",
                              "tags" => "web"
                            }
                          }
                      }
                    )

I on purpose made this more verbose, the reality is that Dataloop.IO should be included in a base and there should be a simple override of the tags attribute here. The above is the entire configuration needed to have servers dynamically add all checks and have them spin up / down and de-register themselves as needed from the central service so you only have servers in Dataloop.IO that are turned on. So what happens when the power is yanked? I hear you cry, well, you get an alert as you’d expect, it is only when the server is shutdown and not power cord yanked to turn off that it de-registers.

Lets look at the bash equivalent, lets say you need a server to have monitoring on it in the next 5 seconds!

sudo curl -s https://download.dataloop.io/setup.sh | bash -s <API_KEY> web

That achieves the same as the chef example above; because the configuration of the monitoring is done in Dataloop the agents are all simple, they just need some auth to connect back in (api key), from there you can either drag them into service groups, add tags or whatever plugins you need. If you tag the group and apply the plugins to the tag then as long as that tag is specified it will get all the relevant plugins. You can also layer as many of these tags on top of each other as you like, the agent will just work it out in real time.

Summary

Yes you can scale dynamically up and down with nagios and Puppet or Chef, but most of these tools all rely on being on all the time, i.e. not cloud centric, more enterprise where they still name their pets… Dataloop.IO doesn’t come with that sort of baggage, no firewall rules, quick and easy to setup and use as it should be. If you’re still not convinced I understand, watch this video first: