Automate to survive

Everyone has a choice

Automate or die, That is pretty much it, you can automate everything or you can keep working with manual processes that slow you down. If you don’t think you have the time to automate, you’re wrong; you need to automate and do it quickly before you get even more busy and even further behind. Maybe you think that you can’t automate because you don’t have the time to do it justice? maybe you can’t automate because the task is to big? Too complicated? well it’s all rubbish.

Start small

This is a bit like eating an elephant, You have to start somewhere, you have to start small, by all means try and start big if you want, but smaller is better. Maybe you have a task to check for new packages from a site once a month, that is a good place to start, pull third party packages from vendors sites into your yum repo or maybe every time you build a server you need to do x, y & z. These sort of tasks are achievable for everyone even those without a good background of programming background which leads on to language choice, not all are equal but knowing two or three is better than just one. At a minimum some sort of terminal language, so Bash,ash,sh or ksh and a ‘Proper’ sort of language that is object orientated like, Ruby, Python or Perl. The terminal languages are good for re-producing what you do on a terminal into a reproducible and consistant format but are terrible for manipulating multiple data sources, mangling data, although with that said you can do some complicated things.

Once you start building up many smaller components of automation start looking at ways of linking it all together so that a series of tasks becomes one. It is this constant cycle of simplifying the process to automate the small chunks and then linking the small chunks together that make an automated system.

Grow large

Over a year ago we use to deploy our environments with Puppet and cloud formations and it use to take about 2.5 days to complete and get it working, that whole process is now down to 10 mins thanks to automation. It required many leaps of faith, many poor decisions and a lot of bug fixing but it got there though simplifying everything down and then automating each component. Other than building the servers and tagging them with appropriate keys in Amazon the whole process is controlled by bash to the point of a working system and is typically very robust. That is a massive time saving, but to get there we had to fail, we had to try and we had to persist.

As a result we now automate large portions of the architecture to a point where all of our time is split between incidents or project work to implement new features hardly any daily grind. Recently I have been working on our DR strategy to take it to the point of clicking a button to deploy a clean environment built from the ground up and automatically pulls the latest backups to restore to the environment but it is now done and saves hours of time building out a DR which makes the recovery time shorter and the process is easier. So larger projects are perfectly achievable with the right attitude!

Summary

Give it a go, start small and work up to it but be un-relenting and do what ever it takes, no matter how much you disagree with it, just do it to get it automated, once more is automated you’ll have some time to fix it up properly or you’ll need to extend it and you can make a small part better then.

Oh no, not java

How strange

Over the past few months I’ve been writing more and more applications to help maintain and deliver the services we run, from metric gathering to regional dr and anything in the middle. For A while now one of the developers at Alfresco has been writing a framework that makes it easier to write selenium tests for Alfresco share which takes a lot of the hassle out of looking for certain elements or class id’s or updating everything if the UI changes. So we have been talking about it for a couple of months and today I decided to get some time to look at it and ask loads of silly questions about eclipse and maven and so on and so forth.

It took about 3 hours to get everything set up and working, most of the time was just spent learning to use eclipse and maven with a walk through of what it can do, how to extend it and how to do stuff with it. Considering I hadn’t done any Java for 6 years it wasn’t that bad and within 15 mins of being left to it I had made a class that loged in and searched for content inside the repo.

One of the reasons we’re so interested in it is because as DevOps we like simple things and it takes a lot of the hassle out, it means we also get to do some complicated things with Share and we only have to worry about what we want to test or measure. All of this got me thinking about the languages we use and the problems they solve.

Right tool, Right job

Currently in our team we are using bash, ruby, python and java. Bash is simple and can achieve some good results although typically quite slow, typically if it is a short script it will end up in bash, although we do our orchestration in bash and it manages the bear metal to working OS by triggering what ever apps we need or setting config.
Ruby is the language of choice for me when I need to do something that requires data to be manipulated or retrying actions or anything that is more than procedural and you can rely on it to do a good job in a reasonable time.
Python is new to the team, it feels a gap which is that it’s as easy to write as Ruby but is more scalable at size, I haven’t done any python yet so I can’t really comment but the web app that has been built with it in a couple of weeks is quite impressive. Java is more complicated harder to write but can offer more complex apps, but typically I’m not sure that you need to make apps that complex.

So I’m not a fan of Java, but mainly because I think it takes a long time to get anything of any value out of it, especially on a small task. If I had to write an application to manage backups I would not go to Java as it’s like using a bazooka to hit a fly, likewise using Bash is like using a feather duster where as ruby and python fit nicely in the middle. Well after todays experience I’m glad I’m doing it in Java, I would have spent weeks making something half as good in Ruby to just avoid using Java and I guess it’s not really that bad.

I could have wasted time doing it all from scratch or just take what’s already written, so I stole like any good DevOps guy would.

Summary

I’m probably going to spend some more time in Java over the next week writing something a bit more useful than todays experiment so hopefully I will still be optimistic about it all, and maybe I’ll remember why I don’t like Java or maybe I’ll change my opinion, who knows!

You have to stick with these things

Hang in there

Just over a year ago I decided that I had, at best a mediocre online presence, Sure if you search for my handle all sorts of stuff turns up and most of it is me but in this age where the internet is ever lasting I didn’t want my previous 10 years of internet to be the defining pattern I left on the world. So with that I decided to annoying people, originally I planned to do two blogs a week, one techy and one non-techy, well quite frankly that’s hard work but it lasted for several months, now I have a much more leisurely one a week post and that seems to be working out better.

When I first started I remember being relatively happy with a small number of users visiting, well over the last year I have grown my blogging empire quit significantly and the best of all of this is statistics, it is nicely measurable. Having that feedback on what articles work and which ones don’t is handy, it doesn’t stop me writing the ones no one likes but at least i’m aware they will be less liked before I write them.

Statistics

Here’s some numbers to make things more interesting

Visits per month

Month visitors
Feb 2012 132
Mar 2012 167
Apr 2012 167
May 2012 284
Jun 2012 387
Jul 2012 407
Aug 2012 460
Sep 2012 491
Oct 2012 728
Nov 2012 1323
Dec 2012 1115
Jan 2013 1572

Up until September I was thinking how the progress was slow but steady, a little disappointed, and then bang, much better! I was talking to my boss a couple of months back and he was mentioning how google seem to sit you in a sandbox for a few months before they really let things go, and that’s sort of what it seems like here, I didn’t write any killer articles that all of a sudden had a spurt, I may have got a link back from Alfresco.com but I didn’t know that, until November, my boss decided to tell everyone, oh well better than employing a marketing person :)

I now hover around 300 visitors a week to the site which is still not a large number but if only 1 percent of those actualy reads an article it maks it worth while!

One of my articles rhn satellite vs puppet, a clear victory? has almost managed 200 views by it’s self. but there’s a few others in there doing okay, and some I thought would do better, so here’s a chance to show some of those that I thought would do better.

The 5 posts that should have done better!

  1. Self healing systems
  2. What university forgot to mention about programming
  3. Cloud deployment 101 – Part3
  4. Cloud deployment 101 – Part1
  5. Who burnt down the building?

There are quite a few more stats that could be shared but they are not that interesting really… so back to the point of why I started the blog, to leave a bit more of an impression well I think I have done that, I have some godo referrals that make it back to me, here’s a couple that always surprised me

And one that came up to day is me being quoted! Here. So when I look back over the year and I see what has been achieved, despite a slow start I’m glad I stuck it out, hopefully when I’m doing this next year I’m talking more about the news article I had to write or tv show I appeared on… well We still need to have dreams else what’s the point!

So I wonder what next year will bring if I keep on blogging, and you the people keep on reading, as let’s face it if it wasn’t for you people I would just be watching tv or something, so Thanks!

Distributed Puppet

Some might say…

Some might say that running puppet as a server is the right way to go, it certainly provides some advantages like puppet db, sotred configs, external resources etc etc but is that really what you want?

If you have a centralised puppet server with 200 or so clients, there’s some fancy things that can be done to ensure that not all nodes hit the server at the same time but that requires setting up and configuring additional tools etc ect…

What if you just didn’t need any of that? what if you just needed a git repo with your manifests and modules in and puppet to be installed?
Have the script download / install puppet, pull down the git repo and then run it locally. This method puts more overhead on a per node basis but not much, it had to run puppet anyway, and in all cases this can still provide the same level of configuration as server client method, you just need to think out side of the server.

Don’t do it it’s stupid!

My response to my boss some 10 months ago when he said we should ditch puppet servers, manifests per server and make all variables outlawed. Our mission was to be able to scale our service to over 1 million users and we realised that manually having to add extra node manifests to puppet was not sustainable so we started on a journey to get rid of the puppet server and redo our entire puppet infrastructure.

Step 1 Set up a git repo, You should already be using one, if you aren’t Shame on you! We chose github, why do something yourself when there are better people out there doing a better job and are dedicated to doing just one thing, spend your time looking after your service not your infrastructure!

Step 2 Remove all manifests based on nodes, replace with a manifest per tier / role. For us this meant consolidation of our prod web role with our qa, test and dev roles so it was just one role file regardless of environment. This forces the management of the environment specific bits into variables.

Step 3 Implement hiera – Hiera gives puppet the ability to externalise variables into configuration files so we now end up with a configuration file per environment and only one role manifest. This, as my boss would say “removes the madness” Now if someone says “what’s the differences between prod and test you diff two files regardless of how complicated you want to make your manifests inherited or not. It’s probably worth noting you can set default variables for Hiera… hiera(“my_var”,”default value”)

Step 4 Parameterise everything – We had lengthy talks about parameterising modules vs just using hiera, but to help keep the modules transparent to what ever is coming into them, and that I was writing them, we kept parameters, I did however move all parameters for all manifests in a module into a “params.pp” file and inherit that everywhere to re-use the variables, within each manifest that always defaults to the params.pp value or is blank (to make it mandatory) This means that if you have sensible defaults you can set them here and reduce the size of your hiera files, which in turn makes it easier to see what is happening. Remember most people don’t care about the underlying technology just the top level settings and trust that the rest is magic… for the lower level bits see these: Puppet with out barriers part one for a generic overview Puppet with out barriers part two for manifest consolidation and Puppet with out barriers part three for params & hiera

This is all good, But what if you were in Amazon? and you don’t know what your box is? Well it’s in a security group but that is not enough information, especially if your security groups are dynamic, you can also Tag your boxes and you should make use, where possible of the aws cli tools to do this. We decided a long time ago to set n a per node basis a few details, Env, Role & Name From this we know what to set the hostname, what puppet manifests to apply and what set of hiera variables to apply as follows…

Step 5 Facts are cool – Write your own custom facts for facter. We did this in two ways, the first was to just pull down the tags from amazon (where we host) and return them as ec2_<tag>, this works but AWS has issues so it fails occasionally, Version2, was to get the tags, cache them locally in files and then facter can pull it from the files locally… something like this…

#!/bin/bash
# Load the AWS config
source /tmp/awsconfig.inc

# Grab all tags locally
IFS=$'\n'
for i in $($EC2_HOME/bin/ec2-describe-tags --filter "resource-type=instance" --filter "resource-id=`facter ec2_instance_id`" | grep -v cloudformation | cut -f 4-)
do
        key=$(echo $i | cut -f1)
        value=$(echo $i | cut -f2-)

        if [ ! -d "/opt/facts/tags/" ]
        then
                mkdir -p /opt/facts/tags
        fi
        if [ -n $value ]
        then
                echo $value > /opt/facts/tags/$key
        /usr/bin/logger set fact $key to $value
        fi
done

The AWS config file just contais the same info you would use to set up any of the CLI tools on linux and you can turn them to tags with this:

tags=`ls /opt/facts/tags/`

tags.each do |keys|
        value = `cat /opt/facts/tags/#{keys}`
        fact = "ec2_#{keys.chomp}"
        Facter.add(fact) { setcode { value.chomp } }
end

Also see: Simple facts with puppet

Step 6 Write your own boot scripts – This is a good one, scripts make the world run. Make a script that installs puppet, make a script that pulls down your git repo, then run puppet at the end (like the following)

The node_name_fact is awesome, as it kicks everything into gear and hooks your deployed boxes in a security group with the appropriate tags to become fully built servers.

Summary

So now, puppet is on each box, every box from the time it’s built knows what it is (thanks to tags) and bootstraps it’s self to a fully working box thanks to your own boot script and puppet. With some well written scripts you can cron the pulling of git and a re-run of puppet if so desired. The main advantage of this method is the distribution, as long as it manages to pull that git repo it will build a box. and if something changes on the box, it’ll put it back, because it has everything locally so no network issues to worry about.