Sentinel update

Many moons ago…

A while back I started to mention the idea of Self healing systems a dedicated system that makes use of monitoring and real time system information to make intelligent decisions about what to do, i.e. I write a complicated program to gradually replace my self. It was suggested about using hooks in Nagios to do the tasks but that misses the intelligence side of what I’m trying to get to, restarting based on Nagios checks is simply an if statement that on a certain condition does something, Sentinel will be more that that.

Back in April I started Sentinel as an open source project As expected the uptake has been phenomenal! absolutely no one has even looked at it :) Either way I am not deterred. I have been on and off re-factoring Sentinel into something a bit more logical Here and I have gone from 3 files to some 13! from 1411 words to 2906 and I even have one fully working unit test! I don’t think I’ll be writing more as at the moment it is not really helping me get to where I want to be quickly but I know I’ll need them at some point!

So far all I have done is split out some of the code to give it structure and added the odd bit here and there. The next thing I need to start doing is to make it better, there’s a number of options:

  • Writing more providers for it so it can start to manage disks, memory etc etc so it’s a bit more useful
  • Sorting out the structure of the code adding in more error handling / logging and resilience
  • Integration with Nagios or some tool that already monitors system health and use that to base actions off of
  • Daemonize Sentinel so it runs like a real program!
  • Configuration file rather than CLI

What to do

I think for me I’d rather sort out the structure of the code and improve what is already there first, I’m in no rush with this so the least I could do is make what I have less hacky. This also gives me the opportunity to start working out how I’d rather have the whole thing structured.

I did look at writing a plugin framework so it would be possible to just drop in a module or something similar and it would have the correct information about how to manage what ever it was written to do, but I figured that was a bit beyond me at this time and I have better things to do!

After that I think the configuration file and daemonizing the application, the main reason for this will be to identify any issues with it running continually any issue here would be nice to know sooner rather than later.

This then leaves more providers and nagios type integration which i’m sure will be fun.

Give it AI!

Once those items are done this leaves sentinel with one more thing to do, start intelligently working out solutions to problems, obviously I don’t know the right way to tackle this however I do have a few ideas though.

In my head… I think how I would solve an issue and inevitably it starts with gathering information about the system, but how do you know what information is relavent to which problems and how much weighting should it have? well for starters I figure each provider would return a score about how healthy it thinks it is. So for example:

A provider for checking the site is available notices that it’s not available; this produces a score that is very high say 10000. It then makes sure it’s got the latest information from all providers on the server. One of those providers is disk which notices one of the volumes is 67% full but the thresholds have been set to warn at 70 and 95 % so it sets a score of say 250 and is ranked in a list somewhere to come back to if all else fails.

At this point it is unlikely that disk is the culprit, we have to assume that whomever set the thresholds knew something about the system, so more information is needed, it checks the local network and gets back a score of 0 as far as the network provider can tell it’s working fine it can get to localhost, the gateway another gateway on the internet. A good test at this point is to try and work out which layer of the OSI model the issue is, so one of the actions might be to connect to port 80 or 443 or both and see what happens, is there a web response? or not, if there is does it have any words in it or a response code that suggests it’s a known web error like a 500 or does it not get connected.

And so on and so forth, this would mean that where ever this logic exists it has to make associations betten results and the following actions. one of the ways to do this is to “tag” a provider with potential subsystems that could affect it then based on the score of each of the subsystems produce a vector of potential areas to check, combined with the score it’s possible to travel the vector and work out how likely each is to fix the issue, as and when each one produces a result it either dives in a new vector either more detailed or not. It would then, in theory be possible to start making correlations between these subsystems, so say the web one requires disk and networking to be available and both the networking and disk require CPU then it can assume that web one needs that and base don how many of these connections exist it can rank it higher or lower much in the same way a search engine would work.

But all of this is for another day, today is just about saying it’s started and I hope to continue on it this year.

Technology archeology

What!?

We’ve all been on a archeological dig from time to time, faced with an environment that no one knows or an installation no one understands and all we’re left to do is dig our way through the evidence to uncover the truth; and much like real archeological digs we only ever come to a truth that we understand not the actual truth.

I don’t think there’s really anything that can be done to mitigate these situations, sure, you could make sure you document the system fully, write every little detail down and have a structured document a child could follow; no one will read it until after the catastrophe it could have saved happens; so you can’t document your self out of the problem.

Well there’s always hand overs and training, and for that to work you just need to get someone to have the same experiences and history as yourself and then two or three months of going over the same stuff and then there’s a small chance that they may remember some of that knowledge when they need to.

So you can’t write it down, you can’t hand it over and you can’t be around to deal with the problem yourself, so what can you do?
Not much, it depends on how much time you have, but the only real way to hand things over is when they fail let someone else deal with it.

Planning for digs

So how can this be mitigated, well I believe standardisation, problem solving and starting over are the best things to do. Lets look at them in a little more detail.

Standards

Everyone has standards, some more than others, what ever the standard is, stick to it, whether you agree or not. By having a standard way of implementing cron jobs, i.e. Always putting them in /etc/cron.d/ rather than crontab or a mixture of cron.d and cron.daily, putting all of your shell scripts into a common directory. Always writing scripts in a certain language, even if it means it will take longer and you have to learn something new…

This doesn’t sound like much, but it means everyone knows where to look to start dealing with problems, there’s no special hidden away super squirrel scripts somewhere that people won’t find.

Problem Solving

Quite simply, Get better. Being able to problem solve quickly is a difficult skill to do accurately but practice makes perfect and the more you do it the easier it gets. This is useful for dealing with problems that are unexpected and the best way to get better is to make sure the person that set it up isn’t the person debugging it all the time, share the workload around and let people be in the deep end struggling while you’re around to help rather than struggling on their own.

Starting over

The biggest barrier to “adopting” a legacy system and owning it for problem analysis is that the people that are trying to support it no longer care, they didn’t put it in, it was put in by a bunch of cowboys badly, with no documentation or poorly written docs and no hand over! It doesn’t matter how good the docs are, you can’t document everything so you always assume a basic understanding, and you can’t tell people that don’t want to listen. So the easiest way to get new people to want to look after systems they didn’t set up or had no part in is to let them do it their way.

So, during the first few months introduce them to the system explain it all and accept the little chirps about there being a better way; at this point they should have a good understanding of the current system and maybe the chirps will disappear, maybe they won’t. If there’s still a lack of adoption it may be worth pretending there are problems with the system and that it’s worth looking at a “better” way of doing it, as such, let them lead the technology charge and just make sure that the system provides the same functionality as before, now all you need to do is learn their system!

Summary

I’m not saying don’t document… I am saying don’t spend a long time on it, bullet points, basic pointers and directions, backup / restore enough that if someone skim reads it looking for information it at least takes them to the next step. I’m also not saying don’t do handovers, do them! just accept that it’s in oen ear and out the other, but at least you tried.
Bear in mind that people that are new to the group have big ideas (like you did) and want the best solution, the thing they’re missing normally is the history, “why was it done that way?”, just remember that they need the history so they can understand why something may be bad and why it was not made better at the time.

Hopefully these little things will help the technological archeological dig not be as deep or take as long.

Simple facts with Puppet

It’s not that hard

When I first started looking at facter it was magic, things just happened and when I entered facter a list of variables appeared and all of these variables are available to use within puppet modules / manifests to help make life easier. After approximately 2 years of thinking how good they were and how nice it would be to have my own I finally took the time to look at it and try to work it out….

For those of you that don’t know, facter is a framework for providing facts about a host system that puppet can use to make intelligent decisions about what to do and can be used to determine the operating system, release of it, local IPs etc etc. This gives you flexibility in puppet to do things like choose what packages to install based on Linux distribution or insert the local IP address into a template.

Writing Facts

So, Lets look at a standard fact that comes with it so you can see the complexity involved an understand why after glancing at it I never went much further.

# Fact: interfaces
#
# Purpose:
#
# Resolution:
#
# Caveats:
#

# interfaces.rb
# Try to get additional Facts about the machine's network interfaces
#
# Original concept Copyright (C) 2007 psychedelys <psychedelys@gmail.com>
# Update and *BSD support (C) 2007 James Turnbull <james@lovedthanlost.net>
#

require 'facter/util/ip'

# Note that most of this only works on a fixed list of platforms; notably, Darwin
# is missing.

Facter.add(:interfaces) do
  confine :kernel => Facter::Util::IP.supported_platforms
  setcode do
    Facter::Util::IP.get_interfaces.collect { |iface| Facter::Util::IP.alphafy(iface) }.join(",")
  end
end

Facter::Util::IP.get_interfaces.each do |interface|

  # Make a fact for each detail of each interface.  Yay.
  #   There's no point in confining these facts, since we wouldn't be able to create
  # them if we weren't running on a supported platform.
  %w{ipaddress ipaddress6 macaddress netmask}.each do |label|
    Facter.add(label + "_" + Facter::Util::IP.alphafy(interface)) do
      setcode do
        Facter::Util::IP.get_interface_value(interface, label)
      end
    end
  end
end

This is stolen, and all it does is provide a comma separated list of interfaces as follows: eth0, eth1 etc etc

Now, when I started looking at facter I knew no ruby and it was a bit daunting, but alas I learnt some and never bothered looking at facter again until my boss managed to simplify one down to it’s bear essentials, which is the one line…

Facter.add("bob") { setcode { "bob" } }

At this point onwards all you need to do is learn some ruby to make sure you can populate that appropriately or, use bash to get the details and populate the fact, in the next example I just grab the pid of apache from ps

apachepid=`ps -fu apache | grep apache | awk '{ print $2}'`

Facter.add(:apachepid) {
	setcode { apachepid }
}

So if you know bash, and you can copy and paste you can do something like the above, now this is ruby, so you can do a lot more complex things but that’s not for today

Okay, so now something more complex is needed…. What if you’re in Amazon and use the Tags on your EC2 instances and you want to use them in puppet ? well you can just query amazon and get a result and use that, although that will take forever and 1 day to run as AWS is not the quickest. This is an issue we had to over come, so we decided to run a script that would query amazon in it’s own time and populate the tags onto the file system, at which point we can read them quickly with facter.

So first, a shell script.

#!/bin/bash
source /path/to/aws/config.inc

# Grab all tags
IFS=$'\n'
for i in $($EC2_HOME/bin/ec2-describe-tags --filter "resource-type=instance" --filter "resource-id=`facter ec2_instance_id`" | grep -v cloudformation | cut -f 4-)
do
        key=$(echo $i | cut -f1)
        value=$(echo $i | cut -f2-)

        if [ ! -d "/opt/facts/tags/" ]
        then
                mkdir -p /opt/facts/tags
        fi
        if [ -n $value ]
        then
                echo $value > /opt/facts/tags/$key
		/usr/bin/logger set fact $key to $value
        fi
done

So this isn’t the best script in the world, but it is simple, it pulls a set of tags out of amazon and basically stores them in a directory where the file name is the tag name and the content is the tag value.
So now we have the facts locally with bash, something we’re all a bit more familiar with we can then take something like facter which is alien ruby and force some bash inside it but still generate facts that provide value

tags=`ls /opt/facts/tags/`

tags.each do |keys|
        value = `cat /opt/facts/tags/#{keys}`
        fact = "ec2_#{keys.chomp}"
        Facter.add(fact) { setcode { value.chomp } }
end

The first thing we do is produce a list of tags (directory list) and then we use some ruby to loop through it and yet more bash to get the values.
None of this is complicated, and hopefully these few examples are enough to encourage people to start writing facts even if they are an abomination to the ruby language but at least you have value without needing to spend time understanding or learning ruby.

Summary

Facts aren’t that hard to write, and thanks to being ruby you can make them as complicated as you like or as simple as you like, and you can even break into bash as needed. So now a caveat, although you can write facts quickly with this half bash/ruby mix by far, just learning ruby will make life easier in the long run, you can then start to incorporate some more complex logic into the facts to provide more value within puppet.

A useful link for facter incase you feel like reading more

Puppet with out barriers -part three

The examples

Over the last two weeks (part one & part two) I have slowly going into detail about module set up and some architecture, well nows the time for real world.

To save me writing loads of puppet code I am going to abbreviate and leave some bits out. First things first a simple module.

init.pp

class javaapp ( $conf_dir = $javaapp::params::conf_dir) inherits javaapp::params {

  class {
    "javaapp::install":
    conf_dir => $conf_dir
  }

}

install.pp

class javaapp::install (conf_dir = $javaapp::params::conf_dir ) inherits javaapp::params {

 package {
    "javaapp":
    name => "$javaapp::params::package",
    ensure => installed,
    before => Class["javaapp::config"],
  }

  file {
    "/var/lib/tomcat6/shared":
    ensure => directory,
  }

}

config.pp

class javaapp::config (app_var1 = $javaapp::params::app_var1,
                       app_var2 = $javaapp::params::app_var2) inherits javaapp::params {

  file {
    "/etc/javaapp/javaapp.conf":
    content => template("javaapp/javaapp.conf"),
    owner   => 'tomcat',
    group   => 'tomcat'
  }
}

params.pp

class javaapp::params ( ) {

$conf_dir = "/etc/javaapp"
$app_var1 = "1.2.3.4/32"
$app_var2 = "host.domain.com"

}

One simple module in the module directory. As you can see I have put all parameters into one file, it use to be that you’d specify the same defaults in every file so in init and config you would duplicate the same variables. Well that is just insane and if you have complicated modules with several manifests in each one it gets difficult to maintain all the defaults. This way they are all in one file and are easy to identify and maintain, it by far isn’t perfect, it does work though and i’m not even sure if puppet supports it and if it doesn’t that is a failing of puppet but it does work with the latest 2.7.18 and i’m sure i’ve had it on all 2.7 variants at some point.

You should be aiming to set sensible defaults set every parameter regardless, but make sure it’s sensible, if you want to enforce the variable is set you can still not put an entry in params and just specify it without a default.

Now the /etc/puppet directory

Matthew-Smiths-MacBook-Pro-2:puppet soimafreak$ ls
auth.conf	extdata		hiera.yaml	modules
autosign.conf	fileserver.conf	manifests	puppet.conf

the auth, autosign and fileserver configs will depend on your infrastructure, but the two important configurations here are puppet.conf and hiera.conf

puppet.conf

[master]
certname=server.domain.com
modulepath = /etc/puppet/modules 
[main]
    # The Puppet log directory.
    # The default value is '$vardir/log'.
    logdir = /var/log/puppet

    # Where Puppet PID files are kept.
    # The default value is '$vardir/run'.
    rundir = /var/run/puppet

    # Where SSL certificates are kept.
    # The default value is '$confdir/ssl'.
    ssldir = $vardir/ssl

		autosign = true
		autosign = /etc/puppet/autosign.conf

[agent]
    # The file in which puppetd stores a list of the classes
    # associated with the retrieved configuratiion.  Can be loaded in
    # the separate ``puppet`` executable using the ``--loadclasses``
    # option.
    # The default value is '$confdir/classes.txt'.
    classfile = $vardir/classes.txt

    # Where puppetd caches the local configuration.  An
    # extension indicating the cache format is added automatically.
    # The default value is '$confdir/localconfig'.
    localconfig = $vardir/localconfig
    pluginsync = true

The only real change worth making to this is in the agent sector, plugin sync ensures that any plugins you install in puppet, like Firewall, VCSRepo, hiera etc are loaded by the agent, obviously on the agent you do not want all of the master config at the top.

Now the hiera.yaml file

hiera.yaml

---
:hierarchy:
      - %{env}
:backends:
    - yaml
:yaml:
    :datadir: '/etc/puppet/extdata'

Okay, to the trained eye this is sort of pointless, it tells puppet that it should look in a directory called /etc/puppet/extdata for a file called %{env}.yaml so in this case if env were to equal bob it would like for a file /etc/puppet/extdata/bob.yaml The advantage to this is that at some point if needed that file could be changed to for example

hiera.yaml

---
:hierarchy:
      - common
      - %{location}
      - %{env}
      - %{hostname}
:backends:
    - yaml
:yaml:
    :datadir: '/etc/puppet/extdata'

This basically provides a location for all variables that you are not able to tie down to a role which will be defined by the manifests.

Matthew-Smiths-MacBook-Pro-2:puppet soimafreak$ ls manifests/roles/
tomcat.pp	default.pp	app.pp	apache.pp

So we’ll look at the default node and tomcat to get a full picture of the variables being passed around.

default.pp

node default {
	
	#
	# Default node - base packages for all systems
	#

  # Define stages
  class {
    "sshd":     stage =>  first;
    "ntp":      stage =>  first;
  }
	
  # Needed for Facter to generate OS related information
  package {
    "redhat-lsb":
    ensure => "installed"
  }

  # mcollective
  class {
    "mcollective":
    mc_password   => "bobbypassword6",
    puppet_server => "puppet.domain.com",
    activemq_host => "mq.domain.com",
  }

  # Manage puppet
  include puppet
}

As you can see, this default node sets up some classes that must be on every box and ensures that packages that are vital are also installed. If you so feel the need to extrapolate this further you could have the default node inherit another node, for example you may have a company manifest as follows:

company.pp

node company {
$mc_password = "bobbypassword6"
$activemq_host = "mq.domain.com"

$puppet_server = $env ? {
    "bob" => 'bobpuppet.domain.com',
    default => 'puppet.domain.com',
  }
}

This company node manifest could be inherited by the default and then instead of having puppet_server => “puppet.domain.com”, you could have puppet_server => $puppet_server, which I think is nice and clear. The only recommendation is to keep your default and your role manifests as simple as possible, try and keep if statements out of them, can you push the decision into hiera? do you have a company.pp that would be sensible to have some env logic in it? are you able to take some existing logic and turn it into a fact ?

Be ruthless and push as much logic out as possible, use the tools to do the leg work and keep puppet manifests and modules simple to maintain.

Now finally the role,

tomcat.pp

node /*tomcat.*/ inherits default {

  include tomcat6

  # Installs java app using the init/install classes and default params, 
  include javaapp

  class {
    "javaapp::config"
    app_var1 => hiera('app_var1') 
    app_var2 => $fqdn
  }
}

The role should be “simple” but it also needs to make it clear what it’s setting, if you notice that several roles use the same class and in most cases the params are the same, change the params file, remove the options from the roles, try and keep it so what is in the roles is only overrides and as minimal as possible. The hiera vars and any set in the default / other node inheritance can all be referenced here.

Summary

Hopefully that helps some of you understand the options for placing variables within puppet in different locations. As I mentioned in the other posts, this method has 3 files, the params, the role the hiera file that’s it, all variables are in one fo those three so there’s no need to hunt through all of the manifests in a module to identify where the the variable may or may not be set, it is either defaulted of overridden, if it’s overridden it will be in the role manifest, from there you can work out if it’s in your default or hiera and so on.

GNU Parallel – When bash just isn’t quick enough

Bash not cutting it for you?

We were looking at using nrpe to do smoke tests of our product after a deployment to give us a feel for how well the system faired with the deployment. Well, we wanted it to run quicker, anyone that has performance tuned bash scripts knows that you strip out as much as possible to and use the in built functions as they are quicker.

Well we had done that and still it wasn’t quickenough. My boss came across parallel Luckily the documentation was pretty good but we still had fun playing with it.

Imagine a situation where you want to run the following commands

echo 1 1
echo 1 2
echo 1 3
echo 2 1
echo 2 2
echo 2 3

Typically to do this you may write some bash loops to achieve this as follows

for i in {1..2}
do
    for x in {1..3}
    do
        echo "echo $i $x"
    done
done

This is a quite simple example, so lets look at something more useful, how would you do the following commands?

service tomcat6 stop
service tomcat6 start
service httpd stop
service httpd start
service nscd stop
service nscd start

Well, if you didn’t realise it’s the same loop above just with slightly different numbers… but what if I said you coud do that all on one line with no complicated syntax? Well this is where parallel fits in, it’s quite useful.

parallel service {1} {2} :::tomcat6 httpd nscd ::: stop start

Now it’s worth mentioning, Never do the above and in fact read on…

So having done Parallel a great disservice by reducing it to argument replacement it is probably worth mentioning you can do some other funky stuff. Which you can find on the documentation page, and it may be more applicable to your needs.

The other useful feature of parallel which I sort of started this blog on is the fact that it is run in parallel…. so if you consider the 6 service commands above, to execute all 6 could take 45 seconds, but by using parallel you could significantly reduce the time it takes to execute.

A real world example

Here is a sample of the smoke test script we’re running which makes use of MCollective with the nrpe plugin, this is a quick script my boss knocked up and I decided to steal some of it as an example, Thanks Steve.

#!/bin/bash

env=$1

function check
{
  echo -e "\nRunning $1 nagios check\n" 
  mco nrpe $1 -I /^$2/ 
}

# Arrays!

Common[0]='check_disk_all'
Common[1]='check_uptime'
Common[2]='check_yum'
Common[3]='check_ssh_processes'
Common[4]='check_load_avg'
Common[5]='check_unix_memory'
Common[6]='check_unix_swap'
Common[7]='check_zombie_processes'

Web[0]='check_apache_processes'

for i in "${Common[@]}"
do
  check $i $1server1
  check $i $1server2
  check $i $1server3
  check $i $1server4
done

# Check the Web nodes

for i in "${Web[@]}"
do
  check $i $1server1
done

Now the script above is abridged but the time is from the full script, and it takes 2 min 36 seconds, Well the parallel script is a little faster at 1 min 25 seconds, not bad.

Here is a sample of the same code above but in parallels

#!/bin/bash

env=$1

# Common checks
parallel mco nrpe {1} --np -q -I /^${env}{2}/ ::: check_disk_all check_uptime check_yum check_ssh_processes check_load_avg check_unix_memory check_unix_swap check_zombie_processes ::: server1 server2 server3 server4

parallel mco nrpe {1} --np -q -I /^${env}{2}/ ::: check_apache_processes ::: server1

It is not really pretty but it is quick.

parallel has all of the features of xargs too but with the added bonus of being in parallel so you can get these massive time savings if needed, the only thing to bear in mind is that it is parallel, sounds like a silly thing to mention but it does mean the service example above would have to be done in two parts to ensure the stop was done before the start.

Hopefully that’s proved a little interesting, enough to play with and make something truly amazing happen, Enjoy.

Apache URL enoding

This was a little annoying…

I came across an interesting Apache quirk the week before last, it totally make sense why it happens and I was at first a little surprised, one because no one had noticed previously and secondly because it was happening at all.

We noticed that if a url like http://bob/file.php?id=$frank went to an apache then the dollar symbol got encoded, which is perfectly normal behaviour, it sees a special character it deals with it. In our case this was being trigged by a URL redirect from http to https. Something I thought was odd which I never got to the bottom of is why did it do the re-write at all? If the http to https rewrite rule was not there it just passes it through so it is a by product of the rewrite.

This in its self is fine, other than manipulation of the url should probably be an option to turn on rather than off but I guess that depends on how popular it is. Either way this can be stopped by simply telling it to not encode the URL with the [NE] flag on the end of the rule.

The annoying element of all this is no one noticed an issue, the application is able to un-encode a URL and to work with the non encoded URL and yet still things were not quite right.

It turned out with a bit of digging that if you sent in a URL of http://bob/file.php?id=%24frank apache ended up encoding the encoded URL resulting in a URL that looked like this – http://bob/file.php?id=%2524frank

I can understand that Apache doesn’t know it’s encoded already, but considering we only send out URL’s with $ in what on earth was causing it to go horribly wrong?

A bit of digging

It turned out that some web-based email service thought the best thing they could do to all URL’s is re-endcode them for you.

For example, Hotmail:

Gmail:

Can you see the difference? try clicking on the image for a more human readable one.

Not sure why our good friends at Microsoft decided it was a good idea to change peoples URL’s, there probably is one, but I’d like to think that Gmail is as complicated as Hotmail and they seem to have found a solution.

Much time of many people was spent working out how this issue occurred, but none the less it is resolved, I do feel a bit silly for not spotting the double encoding myself but at least now I know and you know that Hotmail does URL encoding and Gmail does not.

Open files – why limits are set

Everything has a tolerance…

At some point everything has a boundary of its optimal operation, be it memory allocation, someones temper or the speed of a car. As with all of these things the limits are set for a reason, you can drive a car at 150+ mph all the time, that’s fine, the brakes will wear quicker, the engine will wear quicker. The same can be applied to someones temper, you can push and prod them so much and eventually they will snap and bite your head off.

Sometimes it can be useful to understand what these tolerances are and then once it is understood you can exploit them, but if you don’t understand them you should not be changing the details. One of my pet peeves is people that say “Ergh, why is it limited to 1024 files it needs more than that, set it to unlimited!” Sigh why do these people exist in this world.

Some background

So, web apps such as Tomcat require the use of file handles, in fact everything that runs under linux does with out file handles life becomes difficult. To get a better picture of this rather than search for file handles, try searching for file types, you’ll soon see there’s a lot that file handles have to do. As a result whenever any application get’s turned on, in this case Tomcat, it needs to consume a number of these file handles to be able to operate.

When a connection comes into Tomcat, that consumes a file handle, that handle will remain open until it is closed, the application that initiated it is killed or some time out occurs (I think it’s 2 hours but not sure…). While this connection is active, any subsequent work, for example reading a config file would trigger another file handle, a network poll would trigger another and so on. So at any point when tomcat is doing something it is consuming file handles. This is fine it is normal use, now we’ll come back to that later…

So with all this in mind, when the application does stuff… it needs resources… it’s kinda simple in the way it goes upwards, what seems to escape people is that these resources, much like memory needs to be freed again. You ask for a resource, you use it and you give it back, what happens when you don’t give it back? well it waits until a timeout or the application is killed. This is a resource leak which can lead to interesting things from a security point of view, and from an operational point of view it can stop the server from responding, or at least it would if the kernel didn’t have a hard limit in it, either way though, your box could go horribly wrong if this isn’t controlled.

So, By this point it should be making sense why you would put limits on how many file handles you would want each application to use. Which brings us onto the defaults of 1024, why 1024 and not 1000 ? well every file descriptor will take some memory, so by using 1024 rather than 1000 it allows for better utilisation of the memory and it’s easier for the computer to store a number such as 1024 compared to 1023. Moving along, I’d like to say each file descriptor was Xk but I don’t know the answer (my assumption is 4k but that’s an assumption) either way it is a resource and you can think of it as having physical mass.

Let’s increase the limits

Cool, by this point you have understood a bit of background on file descriptors, now someone say’s “Lets increase the file limit to 2048” That’s twice what it was, but not unreasonable. However, you should still have an understanding as to why it was 1024 and now needs to be 2048. If you don’t you are just throwing file descriptors at it because no one knows… this is bad.

Potentially, the application could be leaking file descriptors because someone forgot to close a connection or the application doesn’t handle the close connection in the expected way.

So a sensible thing is to ask how many is needed, in most cases someone can come back with “in our tests we saw it get to X” this is a good point to go to.

But setting it to unlimited is bad, if an argument comes back of it needs lots, or it’s dynamic etc.. Rubbish it’s a computer program it does exactly what it’s told to do, it will only ever produce X as a maximum of 1 triggered event and it can handle X events.

Imagine a file descriptor as a brick, If you asked a builder to build you a house, and you asked how many bricks he would need, i’m going to assume he will come back with a number based on experience and some maths involved to calculate how many bricks were needed.

I certainly wouldn’t expect him to say he needs all the bricks in the world, there’s not enough space on the building site to store all the bricks. Sure you can stack them up to a point, then it will fall over; the same happens with the OS when you set it to unlimited. Luckily the OS takes some precautions on this by hard limiting the number of open files to a number that is less than the total memory available on the computer. In some cases it restricts it even further.

Summary

So in short if someone say’s set it to unlimited they are probably trying to avoid doing work, either in working out what it is using or fixing a leaking file descriptor problem, these people need to wake up. It takes the stability of a system from a measurable amount an unknown, which is not good.

If you find your self in a situation where it’s out of your control, try to get to the point where the file descriptors are monitored, you can use this to work out an average and some tolerances on what is considered normal usage, and then wait for the application to crash…

DNS results in AWS aren’t always right

A bit of background

For various reasons that are not too interesting we have a requirement to run our own local DNS servers that simply hold the forward and reverse DNS zones for a number of instances. I should point out that the nature of AWS means that doing this approach is not really ideal, specifically if you are not using EIP’s and there are better ways, however thanks to various technologies it is possible to get this solution to work, but don’t overlook the elephant in the room.

What elephant?

A few months ago while doing some proof of concept work I hit a specific issue relating to RDS security groups, specifically where I had added in the security group that my instance was in to grant it access to the DB. One day after the proof of concept had been running for a number of weeks access to the DB suddenly disappeared with no good reason and it was noticed that by adding in the public IP of the instance to the RDS security group access was restored, odd. The issue happened once and it was not seen again for several months, it then came back, odd again, luckily the original ticket was still there and another ticket with AWS was raised, to no avail.

So a bit of a diversion here; if you are using Multi-AZ RDS instances you can’t afford to cache the DNS record, at some random moment it may flip over to a new instance (I have no evidence to support this, but also can’t find any to disprove) so the safest way to get the correct IP address for the DB instance is to ask Amazon for it every time. So you can’t simply take what ever the last IP returned was and set up a local host file or a private DNS record for it, that’s kinda asking for trouble.

So we had a DNS configuration that worked 99.995% of the time flawlessly, and at some random unpredictable time it would flake out, just a matter of time. As everyone should we run multiple DNS servers, which made tracking down the issue a little harder… however eventually I did. Depending on which one of our name servers the instance went to, and how busy AWS’s name server was when which ever of our name servers queried it depended on the results we got back. Occasionally one of the name servers would return the public IP address for the RDS instance, causing the instance to hit the DB on the wrong interface so the mechanism that does the security group lookup within the RDS’s security group was failing; it was expecting the private IP address.

The fix

It took a few mins of looking at the DNS server configuration, and all looked fine, and if it was running in a corporate network that would be fine, but it is not, it’s effectively running in a private network which has a DNS server already running split views. The very simple mistake that was made was the way the forwarders had been set up in the config.

See the following excerpt from here

forward
This option is only meaningful if the forwarders list is not empty. A value of first, the default, causes the server to query the forwarders first, and if that doesn’t answer the question the server will then look for the answer itself. If only is specified, the server will only query the forwarders.

The Forward option had been set to first, which for a DNS server in an enterprise is fine, it will use its forwarders first, if they don’t respond quick enough it will lookup the record on the root name servers. This is typically fine as when you’re looking up a public IP address it doesn’t matter, however when you’re looking up a private IP address against a name server that uses split views it makes a big difference in terms of routing.

What we were seeing was that when AWS name servers were under load / not able to respond quick enough, our Name Server got a reply from the root name servers which were only able to get the public IP address, therefore, our instance routes out to the internet, hits Amazons internet router, turns around and hits the public interface for the RDS security group on its NAT’d public IP and thus not seen as within the security group, Doh!

Luckily the fix is easy, set it to “forward only” and ta-daa, it may mean that you have to wait a few milliseconds longer now and then, but you will get the right result 100% of the time. I think this is a relatively easy mistake for people to make, but can be annoying to track down if you don’t have the understanding of the wider environment.

Summary

Be careful, if you’re running a DNS server in AWS right now I suggest you double check your config.

Probably also worth learning to use “nslookup <domain> <name server ip>” as well to help debug any potential issues with your name servers, but be aware that because of the nature of the problem you are not likely to see this for a long long time, seriously we went months without noticing any issue and then it just happens and if you’re not monitoring the solution it could go un-noticed for a very long time.

Sentinel – An open source start

An open source start

Last week I introduced a concept of Self Healing Systems Which then lead me on to have a tiny tony bit of a think and I decided that I would write one, the decision took all of 5 mins but it gives me an excuse to do something a bit more complex than your every day script.

I created a very simple website here which outlines my goals, as of writing I have got most of the features coded up for the MVP, and I do need to finish it off which will hopefully be soon, which will hopefully be by the time this is published, but let’s see.

I decided to take this on as a project for a number of reasons:

  1. More ruby programming experience
  2. Other than Monit there doesn’t seem to be any other tools, and I had to be told about this one…
  3. It’s a project with just the right amount of programming challenge for me
  4. I like making things work
  5. It may stop me getting called out as often at work if it does what it’s meant to

So a number of reasons, and I’ve already come across a number of things that I don’t know how to solve or what the right way of doing it is. Which is good I get to do a bit of googling and work out a reasonable way, but to be honest that is not going to be enough in the long run. hopefully as time goes on my programming experience will increase sufficiently that I can make improvements to the code as time goes by.

Why continue if there’s products out there that do the same thing?

Why not? Quite often there’s someone doing the same thing even if you can’t find evidence of it, competition should not be a barrier to entry, especially as people like choice.

I guess the important thing is that it becomes usefully different, Take a look at systems management tools, a personal favourite of mine, you have things like RHN Satellite, Puppet and Chef 3 tools, 1 very different from the other two and another only slightly different. People like choice, different tools work differently for different people.

I guess what I mean by that is that some people strike an accord with one or another application and as a result become FanBoys, normally for no good reason.

There’s also the other side of it, I’ve not used monit, I probably should, I probably won’t; but it doesn’t sound like where I want to go with Sentinel. Quite simply I want to replace junior systems administrators, I don’t want another tool to be used, I want a tool that can provide real benefit by doing the checks on the system, by making logical deterministic decisions based on logic and raw data, and not just by looking at the systems it’s on but by considering the whole environment in which it is part of. I think that is a relatively ambitious goal, but I also think it is a useful one, and hopefully it will get to a point where the application is more useful than the MVP product and it can do more than just look after one system.

Like any good open source product it will probably stay at version 0.X for a long time until it has a reasonable set of feature sin it that make it more than just a simple ruby programme.

A call for help

So I’ve started on this path, I intend to continue regardless at the moment and one thing that will help keep me focused is user participation either through using the script and logging bugs at the github site it’s hosted on.

I think at the moment what I need is some guidance on the programming of the project, it’s clear to me that in a matter of months if not weeks this single file application will become overly complicated to maintain and would benefit from being split out into classes. Although I know that, I do not know the right way of doing it I don’t have any experience of larger applications and the right way to do it so if anyone knows that would be good!

In addition to the architecture of the application there is just some programming issues which I’m sure I can overcome at some point but I will probably achieve the solution by having a punt and seeing what sticks. There’s a wonderful switch in the code for processor states which needs to change. I need to iterate through each character of the state and report back on it’s status where as at the moment it is just looking for a combination. To start with I took the pragmatic option, Add all of the processor states mys system has to the witch and hope that’s enough.

So if anyone feels like contributing, or can even see a simple way of fixing some dodgy coding, I’d appreciate it, I guess the only thing I ask is if you are making changes, See the README, Log a ticket in github and commit the changes with reference to the ticket so I know what’s happened and why.

So please, please, please get involved with Sentinel

Self healing systems

An odd beginning

So I’m writing this having just spent the last 10 days on my back in pain and finally starting to feel better, it’s not come at a good time as another member of the same team as me decided they had a “better opportunity” This is the second person to have left this organisation without as much as a passing comment to myself that they were even leaving, how strage; but I digress.

Either way it opens up a void, a team of 2 and a manager now down to a team of one, with the one having back pain that could at any moment take me out of action. Unfortunately up to the day before I was not able to make it to work the system we look after has been surprisingly stable, rock like in-fact; as soon as I say “I’m not going to make it in” the system starts having application issues (JVM crashes).

Obviously the cause needs a bit of looking into and a proper fix etc etc, but in the mean time what do we do? I had an idea, A crazy idea which I don’t think is a fix to any problems but it at least a starting point.

Sentinel

I have spent a bit of time exploring Ruby a few weeks back so I started to look at ways of writing something that would do a simple check; is process X running? In the simple version I wrote it just checked that tomcat was running more than one instance (our application always runs 2) if it was 2, do nothing, if it was more than 2 do nothing (something else crazy has happened so it just logs to that affect) but if it was less than 2 it would try a graceful-ish restart of the service.

So this obviously works in the one specific case that we have, but isn’t extensible and it doesn’t do any better checks, which all got me thinking. Why isn’t there something to do this for us? I don’t know of anything that does this, if anyone does I’d appreciate knowing, there’s a number tools that could be muddled together to do the same sort of function.

Nagios would monitor the system, cucumber could monitor the application interactions, Swatch could monitor the logs, but in most cases these are monitoring, I’m sure there’s ways to get them to carry out actions based on certain outcomes but why use so many tools.

Yes, the individual tools probably do the job better than a single tool, but as a sysadmin, I’d rather have one tool to do everything but that isn’t practical either. So can we some how get the benefits of monitoring with nagios but have a tool that is specifically monitoring the application performance nagios is gathering information about and then making decisions based on that?

The big Idea

So I wonder if it’d be possible to write a simple ruby application that every now and then did a number of actions:

  1. Check the service on the box, right number of processes, not zombied etc, etc
  2. Check the disk capacities
  3. Check the CPU utilisation
  4. Check the memory utilisation
  5. Probe the application from the local box, a loopback test of sorts
  6. Integrate with nagios or another monitoring tool to validate the state it thinks the box is in compared witht he locally gathered stats
  7. Depending on the outcome of all the checks carry out a number of actions
  8. Hooks int ticketing systems

When I was thinking this through the other day, it seemed like a good idea, the biggest issue I have is not being a programmer, So I have a steep learning curve, it’s a complicated application, so requires some thought. I would also probably have to ignore everyone that thinks it is a waste of time, which isn’t too hard to do.

I guess what I’m thinking of is something like FBAR. As a system scales up to hundreds of servers the up time and reliability becomes more important, it is sometimes necessary to take a short term view to keep a system working. The most important thing is that those short term views are totaled up and then logged as tickets, 1% of your severs crashing and needing a restart isn’t an issue, but if that 1% becomes 5% and then 10% it’s panic stations!

Summary

I think my mind is made up, a sentinel is needed to keep watch over a solution, and what’s crazy is that the more I think of it the more useful it seems and the more complicated it seems to become. As such I think I’m going to need help!