Flexible monitoring, going up and down

The other day…

I wrote a post the other week about how much monitoring sucked and there was a number of people on the internet (hello people) that just didn’t get it so I thought more detail would be good. One point that was raised was about the scaling up and down of servers and how that affected the monitoring platform. I wanted to cover this specifically as it is an important topic to understand why I said I think Dataloop.IO was the answer.

Nagios + Puppet

Lets look at a typical Puppet / Nagios approach. Puppet has the concept of exported resources, an exported resource can be collected by another server and then actioned so a cool thing to do is to have a manifest that describes a webserver that looks like this:

# /etc/puppetlabs/puppet/modules/nagios/manifests/target/apache.pp
class nagios::target::apache {
   @@nagios_host { $fqdn:
        ensure => present,
        alias => $hostname,
        address => $ipaddress,
        use => "generic-host",
   }
   @@nagios_service { "check_ping_${hostname}":
        check_command => "check_ping!100.0,20%!500.0,60%",
        use => "generic-service",
        host_name => "$fqdn",
        notification_period => "24x7",
        service_description => "${hostname}_check_ping"
   }
}

The double @ tells puppet to send this resource to the puppet database where something looking for it can pick it up later, so the configuration needed to define a host and to add a ping check is. Once the resource is exported it waits on the server until it is collected, the collection looks like this:

# /etc/puppetlabs/puppet/modules/nagios/manifests/monitor.pp
class nagios::monitor {
    package { [ nagios, nagios-plugins ]: ensure => installed, }
    service { nagios:
        ensure => running,
        enable => true,
        #subscribe => File[$nagios_cfgdir],
        require => Package[nagios],
    }
    # collect resources and populate /etc/nagios/nagios_*.cfg
    Nagios_host <<||>>
    Nagios_service <<||>>
}

The Spaceship (<<||>>) tells Puppet to look for that defined resource in the exported resources puppet database, so in this case a resource of Nagios_host or Nagios_service. This is cool, it means a server that previously had no information about another can now do something useful with the specific information that server now provides. This is a good fit for adding new hosts or service checks to Nagios, so lets look at how you remove them next:

N/A

Seriously… If you want to remove it you would have to do the following, reconfigure the host in puppet so it no longer exports, then purge the DB of previous exports, then re-run puppet on the nagios server to re-add all resources again except the one you removed… sounds fun, you could probably make it work if you knew the server was going to be shutdown. If you don’t believe me see this That’s as good as it gets, sorry.

The real problem

With the uptake of utility based computing servers come and go and we should no longer be precious about them. I always give the same answer when someone in the team asks what we call the new server.

These are farm animals not pets

What do I mean by that? well I don’t care what it’s called or even if it exists, if it causes me any problems I will shoot it in the head and get a new one. Lets look at webservers in an auto scaling group, I sometimes have 3, sometimes 3000. Trying to manage that flexibility in puppet will work for scaling up, and I’m sure there’s a way to manage the scale down (if anyone has a way I’d be interested in hearing it)

So why is Dataloop.IO better? well I think it’s better because I can draw a simple hierarchy in the web UI and take a tag, say ‘web’ and add it to the ‘web servers’ service. When I now install Dataloop.IO using puppet or chef or the setup.sh method I have to provide a few details an API Key and an optional tag or list of tags. So assuming that the configuration is done correctly there will be a ‘web server’ role that all web servers collect from and I just put the tag in there and hay presto the server(s) connect to Dataloop.IO in the right container and then they download all of their checks. Lets cover a few examples:

name "web"
description "Web server Role for configuring servers"
run_list(
  'recipe[apache]',
  'recipe[dataloop]'
)
default_attributes(  { "dataloop" =>
                          { "agent" => {
                              "api_key" => "someapikey",
                              "tags" => "web"
                            }
                          }
                      }
                    )

I on purpose made this more verbose, the reality is that Dataloop.IO should be included in a base and there should be a simple override of the tags attribute here. The above is the entire configuration needed to have servers dynamically add all checks and have them spin up / down and de-register themselves as needed from the central service so you only have servers in Dataloop.IO that are turned on. So what happens when the power is yanked? I hear you cry, well, you get an alert as you’d expect, it is only when the server is shutdown and not power cord yanked to turn off that it de-registers.

Lets look at the bash equivalent, lets say you need a server to have monitoring on it in the next 5 seconds!

sudo curl -s https://download.dataloop.io/setup.sh | bash -s <API_KEY> web

That achieves the same as the chef example above; because the configuration of the monitoring is done in Dataloop the agents are all simple, they just need some auth to connect back in (api key), from there you can either drag them into service groups, add tags or whatever plugins you need. If you tag the group and apply the plugins to the tag then as long as that tag is specified it will get all the relevant plugins. You can also layer as many of these tags on top of each other as you like, the agent will just work it out in real time.

Summary

Yes you can scale dynamically up and down with nagios and Puppet or Chef, but most of these tools all rely on being on all the time, i.e. not cloud centric, more enterprise where they still name their pets… Dataloop.IO doesn’t come with that sort of baggage, no firewall rules, quick and easy to setup and use as it should be. If you’re still not convinced I understand, watch this video first:

What challenges you?

Over the last few weeks

I have been wondering what most people find challenging in the “modern” IT world. There’s been a recent upsurge in tools and technology that address most problems which only leaves me to wonder what is filling that gap? What is the current big annoying problem, maybe it’s not being able to push your architecture into multiple clouds, or having to live with the constraints of small root disk volumes; Who knows? Hence the poll :)

Configuration management alone is not the answer

Everything in one place

Normally when businesses start out building s product, especially those that don’t have the pre-existing knowledge of configuration management, tend to just throw the config on the server and then forget what it is. This is all fine, it’s a way of life and progression and sometime just bashing it out could prove very valuable indeed, but typically this becomes a nightmare to manage. Very quickly when there is then 100 servers all manually built it’s a pain in the arse so then everyone jumps into configuration management.

This is sort of phase 1, everything has become too complicated to manage, no one knows what settings are on what boxes and more time is spent working out if box 1 is the same as box 2. This leads to the need to have some consistency which leads to configuration management, the sensible approach is to move an application at a time into configuration management fully, not just the configuration files.

During this phase of execution it is critical to be pedantic and get as much as possible into configuration management, if you only do certain components there will always be the question of does X affect Y which isn’t in configuration management? and quite frankly, every time you have that conversation a sysadmin dies due to embarrassment.

Reduce & Reuse

After getting to Phase 1, probably in a hack and slash way, the same problems that caused the need for Phase 1 happen. 100 servers in configuration management lots of environments with variables set in them, and servers, and in the manifests themselves and the question starts to be come well is that variable overriding that one, why is there settings for var X in 5 places, which one wins? Granted in configuration management systems there are hierarchies that determine what takes precedence but that requires someone to always look through multiple definitions. On top of having the variables set in multiple locations, it is probably becoming clear that more variables are needed, more logic is needed, what was once a sensible default is now crazy.

This is where phase 2 comes in, aim to move 80%+ of each configuration into variables, have chunks of configuration turned on or off through key variables being set and set sensible defaults inside a module/cookbook. This is half of phase 2, the second half and probably the more important side is to reduce the definitions of the systems down to as few as possible. Back in the day, we use to have a server manifest, an environment manifest and a role manifest each of these set different variables in different places, how do you make sure that your 5 web servers in prod have the same config as the 5 in staging? that’s 14 manifests! why not have 1? just define a role and set the variables appropriately, this can then contain the sensible defaults for that role, all other variables would need to be externalised in something like hiera, or you would need to push them into Facter / ohai.

By taking this approach to minimising the definitions of what a server should be and reducing it down to one you are able to reuse the same configuration so all of your roleX servers are now identical except what ever variables are set in your external data store which can now easily be diff’d.

build, don’t configure

By this point, phase 1 & 2 are done, all is well with the world but still there’s some oddities Box X has a patch level y and box A has a patch level z, or there’s some left over hack to solve a prod issue which causes a problem on one of the servers. Well treat your servers as configurable and throw-away-able, There’s many technologies to help with this be it cloud based with Amazon and OpenStack or maybe VMWare, even physical servers with cobbler. This is Phase 3, build everything from scratch every time, at this point the consistency of the environment is pretty good leaving only the data in each environment to contend with.

Summary

Try and treat configuration management as something more than just config files on servers and be persistent about making everything as simple as possible while trying to get everything into it. If you’re only going to manage the files you might as well use tar’s and if that sounds crazy it’s the same level as phase 1 which is why you have to get everything in and I realise it can seem a massive task but start with the application stack you’re running and then cherry pick the modules/cookbooks that already exist for the main OS components like ntp, ssh etc

Automate to survive

Everyone has a choice

Automate or die, That is pretty much it, you can automate everything or you can keep working with manual processes that slow you down. If you don’t think you have the time to automate, you’re wrong; you need to automate and do it quickly before you get even more busy and even further behind. Maybe you think that you can’t automate because you don’t have the time to do it justice? maybe you can’t automate because the task is to big? Too complicated? well it’s all rubbish.

Start small

This is a bit like eating an elephant, You have to start somewhere, you have to start small, by all means try and start big if you want, but smaller is better. Maybe you have a task to check for new packages from a site once a month, that is a good place to start, pull third party packages from vendors sites into your yum repo or maybe every time you build a server you need to do x, y & z. These sort of tasks are achievable for everyone even those without a good background of programming background which leads on to language choice, not all are equal but knowing two or three is better than just one. At a minimum some sort of terminal language, so Bash,ash,sh or ksh and a ‘Proper’ sort of language that is object orientated like, Ruby, Python or Perl. The terminal languages are good for re-producing what you do on a terminal into a reproducible and consistant format but are terrible for manipulating multiple data sources, mangling data, although with that said you can do some complicated things.

Once you start building up many smaller components of automation start looking at ways of linking it all together so that a series of tasks becomes one. It is this constant cycle of simplifying the process to automate the small chunks and then linking the small chunks together that make an automated system.

Grow large

Over a year ago we use to deploy our environments with Puppet and cloud formations and it use to take about 2.5 days to complete and get it working, that whole process is now down to 10 mins thanks to automation. It required many leaps of faith, many poor decisions and a lot of bug fixing but it got there though simplifying everything down and then automating each component. Other than building the servers and tagging them with appropriate keys in Amazon the whole process is controlled by bash to the point of a working system and is typically very robust. That is a massive time saving, but to get there we had to fail, we had to try and we had to persist.

As a result we now automate large portions of the architecture to a point where all of our time is split between incidents or project work to implement new features hardly any daily grind. Recently I have been working on our DR strategy to take it to the point of clicking a button to deploy a clean environment built from the ground up and automatically pulls the latest backups to restore to the environment but it is now done and saves hours of time building out a DR which makes the recovery time shorter and the process is easier. So larger projects are perfectly achievable with the right attitude!

Summary

Give it a go, start small and work up to it but be un-relenting and do what ever it takes, no matter how much you disagree with it, just do it to get it automated, once more is automated you’ll have some time to fix it up properly or you’ll need to extend it and you can make a small part better then.

Distributed Puppet

Some might say…

Some might say that running puppet as a server is the right way to go, it certainly provides some advantages like puppet db, sotred configs, external resources etc etc but is that really what you want?

If you have a centralised puppet server with 200 or so clients, there’s some fancy things that can be done to ensure that not all nodes hit the server at the same time but that requires setting up and configuring additional tools etc ect…

What if you just didn’t need any of that? what if you just needed a git repo with your manifests and modules in and puppet to be installed?
Have the script download / install puppet, pull down the git repo and then run it locally. This method puts more overhead on a per node basis but not much, it had to run puppet anyway, and in all cases this can still provide the same level of configuration as server client method, you just need to think out side of the server.

Don’t do it it’s stupid!

My response to my boss some 10 months ago when he said we should ditch puppet servers, manifests per server and make all variables outlawed. Our mission was to be able to scale our service to over 1 million users and we realised that manually having to add extra node manifests to puppet was not sustainable so we started on a journey to get rid of the puppet server and redo our entire puppet infrastructure.

Step 1 Set up a git repo, You should already be using one, if you aren’t Shame on you! We chose github, why do something yourself when there are better people out there doing a better job and are dedicated to doing just one thing, spend your time looking after your service not your infrastructure!

Step 2 Remove all manifests based on nodes, replace with a manifest per tier / role. For us this meant consolidation of our prod web role with our qa, test and dev roles so it was just one role file regardless of environment. This forces the management of the environment specific bits into variables.

Step 3 Implement hiera – Hiera gives puppet the ability to externalise variables into configuration files so we now end up with a configuration file per environment and only one role manifest. This, as my boss would say “removes the madness” Now if someone says “what’s the differences between prod and test you diff two files regardless of how complicated you want to make your manifests inherited or not. It’s probably worth noting you can set default variables for Hiera… hiera(“my_var”,”default value”)

Step 4 Parameterise everything – We had lengthy talks about parameterising modules vs just using hiera, but to help keep the modules transparent to what ever is coming into them, and that I was writing them, we kept parameters, I did however move all parameters for all manifests in a module into a “params.pp” file and inherit that everywhere to re-use the variables, within each manifest that always defaults to the params.pp value or is blank (to make it mandatory) This means that if you have sensible defaults you can set them here and reduce the size of your hiera files, which in turn makes it easier to see what is happening. Remember most people don’t care about the underlying technology just the top level settings and trust that the rest is magic… for the lower level bits see these: Puppet with out barriers part one for a generic overview Puppet with out barriers part two for manifest consolidation and Puppet with out barriers part three for params & hiera

This is all good, But what if you were in Amazon? and you don’t know what your box is? Well it’s in a security group but that is not enough information, especially if your security groups are dynamic, you can also Tag your boxes and you should make use, where possible of the aws cli tools to do this. We decided a long time ago to set n a per node basis a few details, Env, Role & Name From this we know what to set the hostname, what puppet manifests to apply and what set of hiera variables to apply as follows…

Step 5 Facts are cool – Write your own custom facts for facter. We did this in two ways, the first was to just pull down the tags from amazon (where we host) and return them as ec2_<tag>, this works but AWS has issues so it fails occasionally, Version2, was to get the tags, cache them locally in files and then facter can pull it from the files locally… something like this…

#!/bin/bash
# Load the AWS config
source /tmp/awsconfig.inc

# Grab all tags locally
IFS=$'\n'
for i in $($EC2_HOME/bin/ec2-describe-tags --filter "resource-type=instance" --filter "resource-id=`facter ec2_instance_id`" | grep -v cloudformation | cut -f 4-)
do
        key=$(echo $i | cut -f1)
        value=$(echo $i | cut -f2-)

        if [ ! -d "/opt/facts/tags/" ]
        then
                mkdir -p /opt/facts/tags
        fi
        if [ -n $value ]
        then
                echo $value > /opt/facts/tags/$key
        /usr/bin/logger set fact $key to $value
        fi
done

The AWS config file just contais the same info you would use to set up any of the CLI tools on linux and you can turn them to tags with this:

tags=`ls /opt/facts/tags/`

tags.each do |keys|
        value = `cat /opt/facts/tags/#{keys}`
        fact = "ec2_#{keys.chomp}"
        Facter.add(fact) { setcode { value.chomp } }
end

Also see: Simple facts with puppet

Step 6 Write your own boot scripts – This is a good one, scripts make the world run. Make a script that installs puppet, make a script that pulls down your git repo, then run puppet at the end (like the following)

The node_name_fact is awesome, as it kicks everything into gear and hooks your deployed boxes in a security group with the appropriate tags to become fully built servers.

Summary

So now, puppet is on each box, every box from the time it’s built knows what it is (thanks to tags) and bootstraps it’s self to a fully working box thanks to your own boot script and puppet. With some well written scripts you can cron the pulling of git and a re-run of puppet if so desired. The main advantage of this method is the distribution, as long as it manages to pull that git repo it will build a box. and if something changes on the box, it’ll put it back, because it has everything locally so no network issues to worry about.

Openstack

I played with hardware!

It has been over a year since I had to play with hardware properly to achieve something practical, but that is part of the joy of being in the world of cloud computing. That world where you don’t own anything, you pay by the hour and occasionally things go horribly wrong but you delete it an start again; the throw away society of cloud computing.

Every now and then I get frustrated with AWS, normally because there is a something wrong, lets say a box that is meant to have unrivalled resources starts going slow, you end up doing some investigation but the answer is simply that the underlying hypervisor is busy, probably due to other people hammering the server for some reason… Either way in a cloudy world your choices are thus:

  1. Wait it out
  2. throw it away

You could hope the problem gets better or you could delete the server and build it somewhere else and hope that one is better, rinse and repeat the above two until a stable service is resumed.

Being throw away is really useful, it enables you to re-build quickly and not suffer to much if something major happens so I think people (you…) should make sure that no matter where your server is you can rebuild it from scratch in less than 10 mins. If you had the ability to still be throw away and request servers as and when you wanted via a WebUI or a CLi or some API calls but in addition to all of that you had the control of the physical hardware you could optimise what was running on the hypervisor to offer the best performance, this is all very good but is not with out its draw backs; someone has to physical rack / cable in all of the servers that are running the infrastructure, someone has to firmware patch them and replace dead hard drives and do all of that Boring stuff that cloud folk have forgotten about.

So what about Openstack

So for those that don’t know OpenStack is a private cloud, this means you can run services in your data centre that mimic AWS, You get the Block storage (EBS) in the form of Cinder, you get Object storage (S3) Instance storage (EC2) and a host of other things that I won’t go into. So the API may not be 100% the same as AWS and the features that you have in AWS may not be available in Openstack yet, but it’s catching up and it’s doing so rapidly. I would predict that over the next 2-3 years we see openstack compete with AWS for features and even start seeing AWS taking features that openstack has and porting them to AWS. So definitely one to watch.

Over the last 3-6 months it had come up a few times about openstack and I put it on my todo list to have a play but quite frankly I had other things to be doing. Well last week I was asked to help set up the SAN and network for a openstack PoC for the internal IT, falling back on my not as legacy as I’d like Cisco skills and having used the same SAN tech before it wasn’t long to get that set up and I thought it would take ages to get the various components of openstack up and working. Well it could have if it wasn’t for one saving grace, the PoC on a disk that Rackspace provide Here it may not be the latest or the most perfect but it saved a lot of time in getting something up and working and if you aren’t sure what it is I would suggest getting a few bits of legacy kit and having a play like we did, just set aside two or three days to play with the technology and to set up the various elements of it, it’s worth a play.

There’s already a few advantages of openstack vs aws, a silly one for me is a console. Openstack gives you VNC access to your servers, you can now survive any minor iptables glitch or networking mishap by your self, yes I know it should all be throw away, but sometimes the box has some data on it that is important or you want to know what went wrong and having a console is good. Lets not overlook the fact that you’re calling the shots so if it doesn’t do what you want it too you could if you wanted commit code back to make it better, change the hardware spec, distribution of VM’s or any other element in a thousand that you may need to control, with this you can.

But it’s not all good, it still comes back to managing your own data centre and there’s very few companies or services that get to a size where they have to move off of AWS for performance reasons, typically you’d move off of AWS to save a few dollars, but by the time you fator in additional head count for maintaining the physical boxes, power, cooling, rack locations, geographically diverse locations and the infrastructure services, the platform and it’s skill set you may not be saving as much money as you want, but you’ll probably break even with the advantage of controlling the whole underlying infrastructure on top of still having the throwaway nature a cloud services.

I’m not saying you should and could make it so you support everything all the time even high bursts of traffic, but at least you could use public cloud for what it’s good for, bursting onto when times get hard and more processing power is needed. Granted to be able to do that all systems would need to be automated and be able to migrate at the push of a button. By the time you’ve gone through that whole process with all of your applications either in a private cloud or in public cloud it wouldn’t matte rif you had to u-turn tomorrow you could do that. As long as you’re smart enough to oly use services that are available in multiple places i.e. in openstack and AWS.

Interesting times ahead I think.

AWS CopySnapshot – A regional DR backup

Finally!

After many months of talking with Amazon about better ways of getting backups from one region to another they sneak in a sneaky little update on their blog I will say it here, World changing update! The ability to easily and readily sync your EBS data between regions is game changing, I kid you not, in my tests I synced 100GB from us-east-1 to us-west-1 in such a quick time it was done before I’d switched to that region to see it! However… sometimes it is a little slower… Thinking about it, it could have been a blank volume I don’t really know :/

So at Alfresco we do not heavily use EBS as explained Here when we survived a major amazon issue that affected many larger websites than our own. We do still have EBS volumes as it is almost impossible to get rid of them, and by the very nature the data that is on these EBS volumes is very precious so obviously we want it backed up. A few weeks ago I started writing a backup script for EBS volumes, well the script wasn’t well tested it was quickly written but it worked. I decided that I would very quickly, well to be fair I spent ages on it, update the script with the new CopySnapshot feature.

At the time of writing, the CopySnapshot exists in one place, the deepest, darkest place known to man, the Amazon Query API interface; this basically means that rather than simply doing some method call you have to throw all the data to it and back again to make it hang together, for the real programmers out there this is just an inconvenience for me it is a nightmare, it was an epic battle between my motivation, my understanding and my google prowess, in short I won.

It was never going to be easy…

In the past I have done some basic stuff with REST type API’s, set some header, put some variable on the params of the url and just let it go, all very simple, Amazon’s was slightly more advanced to say the least.

So I had to use this complicated encoded, parsed encrypted and back to back handshake as described here with that and the CopySnapshot docs I was ready to rock!

So after failing for an hour to even get my head around the authentication I decided to cheat, and use google. The biggest break through was thanks to James Murty the AWS class he has written is perfect, the only issue was my understanding on how to use modules in ruby which were very new to me. On a side note i thought Modules were meant to fix issues with name space but for some reason even though I included the class in script it seemed to conflict with the ruby aws-sdk I already had so I just had to rename the class / file from AWS to AWSAPI and all was then fine. That and I also had to add a parameter to pass in the AWS_ACCESS_KEY which was a little annoying as I thought the class would have taken care of that, but to be fair it wasn’t hard to work out in the end.

So first things first, have a look at the AWS.rb file on the site it does the whole signature signing bit well and saves me the hassle of doing or thinking about it. On a side note, this all uses version 2 of the signing which I imagine will be deprecated at some point as version 4 is out and about Here

If you were proactive you’ve already read the CopySnapshot docs and noticed that in plain english or complicated that page does not tell you how to copy between regions. I imagine it’s because I don’t know how to really use the tools but it’s not clear to me… I had noticed that th wording they used was identical to the params being passed in the example so I tried using Region, DestinationRegion, DestRegion all failing, kind of as expected seeing as I was left to guess my way through; I was there, that point where you’ve had enough and it doesn’t look like it is ever going to work so I started writing a support ticket for Amazon so they could point out what ever it was I was missing at the moment of just about to hit submit I had a brainwave. If the only option is to specify the source then how do you know the destination? well, I realised that each region has its own API url, so would that work as the destination? YES!

The journey was challenging, epic even for this sysadmin to tackle and yet here we are, a blog post about regional DR backups of EBS snapshots so without further ado, and no more gilding the lily I present some install notes and code…

Make it work

The first thing you will need to do is get the appropriate files, the AWS.rb from James Murty. Once you have this You will need to make the following changes:

21c21
< module AWS
---
> module AWSAPI

Next you will need to steal the code for the backup script:

#!/usr/bin/ruby

require 'rubygems'
require 'aws-sdk'
require 'uri'
require 'crack'

#Get options
ENV['AWS_ACCESS_KEY']=ARGV[0]
ENV['AWS_SECRET_KEY']=ARGV[1]
volumes_file=ARGV[2]
source_region=ARGV[3]
source_region ||= "us-east-1"

#Create a class for the aws module
class CopySnapshot
  #This allows me to initalize the module with out re-writing it
  require 'awsapi'
  include AWSAPI

end

def get_dest_url (region)
  case region
  when "us-east-1"
    url = "ec2.us-east-1.amazonaws.com"
  when "us-west-2"
    url = "ec2.us-west-2.amazonaws.com"
  when "us-west-1"
    url = "ec2.us-west-1.amazonaws.com"
  when "eu-west-1"
    url = "ec2.eu-west-1.amazonaws.com"
  when "ap-southeast-1"
    url = "ec2.ap-southeast-1.amazonaws.com"
  when "ap-southeast-2"
    url = "ec2.ap-southeast-2.amazonaws.com"
  when "ap-northeast-1"
    url = "ec2.ap-northeast-1.amazonaws.com"
  when "sa-east-1"
    url = "ec2.sa-east-1.amazonaws.com"
  end
  return url
end

def copy_to_region(description,dest_region,snapshotid, src_region)

  cs = CopySnapshot.new

  #Gen URL
  
  url= get_dest_url(dest_region)
  uri="https://#{url}"

  #Set up Params
  params = Hash.new
  params["Action"] = "CopySnapshot"
  params["Version"] = "2012-12-01"
  params["SignatureVersion"] = "2"
  params["Description"] = description
  params["SourceRegion"] = src_region
  params["SourceSnapshotId"] = snapshotid
  params["Timestamp"] = Time.now.iso8601(10)
  params["AWSAccessKeyId"] = ENV['AWS_ACCESS_KEY']

  resp = begin
    cs.do_query("POST",URI(uri),params)
  rescue Exception => e
    puts e.message
  end

  if resp.is_a?(Net::HTTPSuccess)
    response = Crack::XML.parse(resp.body)
    if response["CopySnapshotResponse"].has_key?('snapshotId')
      puts "Snapshot ID in #{dest_region} is #{response["CopySnapshotResponse"]["snapshotId"]}" 
    end
  else
    puts "Something went wrong: #{resp.class}"
  end
  
end

if File.exist?(volumes_file)
  puts "File found, loading content"
  #Fix contributed by Justin Smith: http://soimasysadmin.com/2013/01/09/aws-copysnapshot-a-regional-dr-backup/#comment-379
  ec2 = AWS::EC2.new(:access_key_id => ENV['AWS_ACCESS_KEY'], :secret_access_key=> ENV['AWS_SECRET_KEY']).regions[source_region]
  File.open(volumes_file, "r") do |fh|
    fh.each do |line|
      volume_id=line.split(',')[0].chomp
      volume_desc=line.split(',')[1].chomp
      if line.split(',').size >2
        volume_dest_region=line.split(',')[2].to_s.chomp
      end
      puts "Volume ID = #{volume_id} Volume Description = #{volume_desc}"
      v = ec2.volumes["#{volume_id}"]
      if v.exists? 
        puts "creating snapshot"
        date = Time.now
        backup_string="Backup of #{volume_id} - #{date.day}-#{date.month}-#{date.year}"
        puts "#{backup_string}" 
        snapshot = v.create_snapshot(backup_string)
        sleep 1 until [:completed, :error].include?(snapshot.status)
        snapshot.tag("Name", :value =>"#{volume_desc} #{volume_id}")
        # if it should be backed up to another region do so now
        if !volume_dest_region.nil? 
          if !volume_dest_region.match(/\s/) ? true : false
            puts "Backing up to #{volume_dest_region}"
            puts "Snapshot ID = #{snapshot.id}"
            copy_to_region(volume_desc,volume_dest_region,snapshot.id,source_region)
          end
        end
      else
        puts "Volume #{volume_id} no longer exists"
      end
    end
  end
else
  puts "no file #{volumes_file}"
end

Once you have that you will need to create a file with the volume sin to backup, in the following format:

vol-127facd,My vol,us-west-1
vol-1ac123d,My vol2
vol-cd1245f,My vol3,us-west-2

The format is “volume id, description,region” the region is where you want to backup to. once you have these details you just call the file as follows:

ruby ebs_snapshot.rb <Access key> <secret key> <volumes file>

I don’t recommend putting your key’s on the CLI or even in a cron job but it wouldn’t take much to re-facter this into a class if needed and if you were bothered about that.
It should work quite well if anyone has any problems let me know and I’ll see what I can do :)

EBS Snapshot script

Like it say’s

Over the last 18 months, the one key thingI have learnt about amazon is don’t use EBS, in any way shape or form, in most cases it will be okay, but if you start relying on it it can ruin even the best architected service and reduce it to a rubble. So you can imagine how pleased I was to find I’d need to write something to make EBS snapshots.

For those of you that don’t know, Alfresco Enterprise has an amp to connect to S3 which is fantastic and makes use of a local cache while it’s waiting for the s3 buckets to actually write the data and if you’re hosting in Amazon this is the way to go. It means you can separate the application from the OS & data, which is important for the following reasons:

1, EBS volumes suck, so where possible don’t use them for storing data, or for your OS,
2, Having data else where means you can, with out prejudice delete nodes and your data is safe
3, It forces you to build an environment that can be rapidly re-built

So in short, data off of the server means you can scale up and down easily and you can rebuild easily, the key is always to keep the distinctively different areas separate and do not merge them together.

So facing this need to backup EBS volumes I’d thought I’d start with snapshots, I did a bit of googling and came across a few ebs snapshot programs that seem to do the job, but I wanted one in Ruby and I’ve used amazon’s SDK’s before so why not write my own.

The script

#!/usr/bin/ruby

require 'rubygems'
require 'aws-sdk'

#Get options
access_key_id=ARGV[0]
secret_access_key=ARGV[1]



if File.exist?("/usr/local/bin/backup_volumes.txt")
  puts "File found, loading content"
  ec2 = AWS::EC2.new(:access_key_id => access_key_id, :secret_access_key=> secret_access_key)
  File.open("/usr/local/bin/backup_volumes.txt", "r") do |fh|
    fh.each do |line|
      volume_id=line.split(',')[0].chomp
      volume_desc=line.split(',')[1].chomp
      puts "Volume ID = #{volume_id} Volume Description = #{volume_desc}}"
      v = ec2.volumes["#{volume_id}"]
      if v.exists? 
        puts "creating snapshot"
        date = Time.now
        backup_string="Backup of #{volume_id} - #{date.day}-#{date.month}-#{date.year}"
        puts "#{backup_string}" 
        snapshot = v.create_snapshot(backup_string)
        sleep 1 until [:completed, :error].include?(snapshot.status)
        snapshot.tag("Name", :value =>"#{volume_desc} #{volume_id}")
      else
        puts "Volume #{volume_id} no longer exists"
      end
    end
  end
else
  puts "no file backup_volumes.txt"
end

I started writing it with the idea of having it just backup all EBS volumes that ever existed, but I thought better of it. So I added a file “backup_volumes.txt” so instead it will lead this and look for a volume id and a name for it, i.e.

vol-1264asde,Data Volume

if you wanted to backup everything it wouldn’t take much to extend this, i.e. change the following:

v = ec2.volumes[&quot;#{volume_id}&quot;]

To

ec2.volumes.each do |v|

or at least something like that…

Anyway, the file takes the keys via the cli as params to the script so it makes it quite easy to run the script on one server in several cron jobs with different keys if needed.

It’s worth mentioning at this point that within AWS you should be using IAM to restrict the EBS policy down to the bear minimum something like this is a good start:

{
  &quot;Statement&quot;: [
    {
      &quot;Sid&quot;: &quot;Stmt1353967821525&quot;,
      &quot;Action&quot;: [
        &quot;ec2:CreateSnapshot&quot;,
        &quot;ec2:CreateTags&quot;,
        &quot;ec2:DescribeSnapshots&quot;,
        &quot;ec2:DescribeTags&quot;,
        &quot;ec2:DescribeVolumeAttribute&quot;,
        &quot;ec2:DescribeVolumeStatus&quot;,
        &quot;ec2:DescribeVolumes&quot;
      ],
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Resource&quot;: &quot;*&quot;
    }
  ]
}

For what it’s worth, you can spend a while reading loads of stuff to work out how to set up the policy or just use the policy generator

Right, slightly back on topic, It also tags the name of the volume for you because we all know the description field isn’t good enough.

Well that is it. Short and something, something… Now the disclaimer, I ran the script a handful of times and it seemed good, So please test it :)

Monitoring Alfresco Solr

It’s not just about numbers

Up until recently if you wanted to monitor Alfresco’s solr usage you would have had to either use a costly call to the stats page use the summary report that only really gave you a lag number. Luckily because Alfresco have extended solr they have changed the Summary report to provide some really useful information which can then be tracked via nagios or what ever your favourite monitoring tool is.

Firstly it’s worth reading the Wiki as it explains the variables better than I would, it’s also worth mentioning my preferred way of programatically access this page is via json like so:

 
http://localhost:8080/solr/admin/cores?action=SUMMARY&wt=json

It’s worth mentioning that depending on the json parsing library you are using you can get some fatal parsing errors caused by the hit ratio, For what it’s worth I found Crack to be good, it doesn’t validate the json as heavily as the raw json one does which means you can pull back all the data even if there is a problem with the hitratios.

On that subject, before the relavent cache is hit, the hit ratio will display “NaN” (Not a Number) once it has been hit it will display the appropriate number, which I’ll dive into a bit more later.

So before getting into the nitty gritty service checks, it’s important to have a good understanding of the numbers, most of them are straight forward; the only one that confused me was the hit ratios.

The hit ratio is a number between 0 and 1, when the number is greater than say 0.3 all is well, less than 0.3 things could be bad. However, when the hit count is less than say 100, it would be expected that the hit ratio is low as it is not being hit enough to provide a reasonable response. Other than the hit ratio the others are pretty straight forward.

Some code

It’s probably worth me sharing with you the class I’m using to access/return solr information, that way if you want to write your own nagios checks you can just copy / paste

Firstly, the class that get’s all the solr information:

#
# Solr Metric gatherer

require 'rubygems'
require &quot;crack&quot;
require 'open-uri'

class SolrDAO

  def initialize (url)
    @solr_hash = get_metrics(url)
  end

  def get_lag(index)
    lag = @solr_hash[&quot;Summary&quot;][index][&quot;TX Lag&quot;]
    regex= Regexp.new(/\d*/)
    lag_number = regex.match(lag)
    return lag_number
  end

  def get_alfresco_node_in_index(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;Alfresco Nodes in Index&quot;]
  end
  
  def get_num_docs(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;Searcher&quot;][&quot;numDocs&quot;]
  end
  
  def get_alfresco_avgTimePerRequest(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfresco&quot;][&quot;avgTimePerRequest&quot;]
  end

  def get_afts_avgTimePerRequest(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/afts&quot;][&quot;avgTimePerRequest&quot;]
  end

  def get_cmis_avgTimePerRequest(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/cmis&quot;][&quot;avgTimePerRequest&quot;]
  end

  def get_mean_doc_transformation_time(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;Doc Transformation time (ms)&quot;][&quot;Mean&quot;]
  end

  def get_queryResultCache_lookups(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/queryResultCache&quot;][&quot;lookups&quot;]
  end
  
  def get_queryResultCache_hitratio(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/queryResultCache&quot;][&quot;hitratio&quot;]
  end
  
  def get_filterCache_lookups(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/filterCache&quot;][&quot;lookups&quot;]
  end
  
  def get_filterCache_hitratio(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/filterCache&quot;][&quot;hitratio&quot;]
  end
  
  def get_alfrescoPathCache_lookups(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoPathCache&quot;][&quot;lookups&quot;]
  end
  
  def get_alfrescoPathCache_hitratio(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoPathCache&quot;][&quot;hitratio&quot;]
  end
  
  def get_alfrescoAuthorityCache_lookups(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoAuthorityCache&quot;][&quot;lookups&quot;]
  end
  
  def get_alfrescoAuthorityCache_hitratio(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoAuthorityCache&quot;][&quot;hitratio&quot;]
  end
  
  def get_queryResultCache_warmupTime(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/queryResultCache&quot;][&quot;warmupTime&quot;]
  end
  
  def get_filterCache_warmupTime(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/filterCache&quot;][&quot;warmupTime&quot;]
  end
  
  def get_alfrescoPathCache_warmupTime(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoPathCache&quot;][&quot;warmupTime&quot;]
  end
  
  def get_alfrescoAuthorityCache_warmupTime(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoAuthorityCache&quot;][&quot;warmupTime&quot;]
  end
  
  private
  def get_metrics(url)
    url += &quot;&amp;wt=json&quot;
    response = open(url).read
    # Convert to hash
    result_hash = {}
    result_hash = Crack::JSON.parse(response)
    # if the hash has 'Error' as a key, we raise an error
    if result_hash.has_key? 'Error'
      raise &quot;web service error&quot;
    end
    return result_hash
  end

end # End of class

As you can see it is quite straight forward to extend this if you want to pull back different metrics. At some point I will hook this into a git hub repo for people or use it in another metrics based project, but for now just use this.

Now some of you may not be use to using ruby, so here’s is a check that checks the filtercache hitratio

#!/usr/bin/ruby
$:.unshift File.expand_path(&quot;../&quot;, __FILE__)
require 'lib/solr_dao.rb'
solr_results=SolrDAO.new(&quot;http://localhost:8080/solr/admin/cores?action=SUMMARY&quot;)
hitratio=solr_results.get_filterCache_hitratio(&quot;alfresco&quot;).to_f
lookups=solr_results.get_filterCache_lookups(&quot;alfresco&quot;).to_i
#Hit ratio is an inverse, 1.0 is perfect 0.1 is crap, and can be ignored if there is less than 100 lookups
inverse=(1.0-hitratio)
critical=0.8
warning=0.7
if (inverse.is_a? Float)
  if ( lookups &gt;= 100 )
    if ( inverse &gt;= warning )
      if (inverse &gt;= critical )
        puts &quot;CRITICAL :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;&quot;
        exit 2
      else
        puts &quot;WARNING :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;&quot;
        exit 1
      end
    else
      puts &quot;OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;&quot;
      exit 0
    end
  else
    puts &quot;OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;&quot;
    exit 0
  end
else
  puts &quot;UNKNOWN :: FilterCache hitratio is #{hitratio}&quot;
  exit 3
end
[/sourecode]

to get this to work, you'll just need to put it with your other nagios checks, and in the same directory as the above put a lib directory with the solr_DAO from further up in it, if you need to change it's location you will only need to adjust the following:


$:.unshift File.expand_path(&quot;../&quot;, __FILE__)
require 'lib/solr_dao.rb'

Also if you wanted to you could modify the script to take the critical and warning as params so you can easily change it within nagios.

DevOps team DNA

Hi, this is my first post on Matt’s blog. I’ve been an avid supporter of his blogging for a while and today got an invite to contribute. So here’s my post (created very quickly before he changes his mind).

My job has always been within an operations department of software product companies. I started at a small company as ‘everything’ support and slowly drifting towards a specialisation in the more recently branded DevOp’sy areas as I made my way through various acquisitions and mergers. Over the past couple of years I’ve found myself building DevOps teams. During that time I’ve discovered some of the things that work and almost everything that doesn’t work (or it feels like that :) ).

Some of the things that have worked..   (for me anyway)

Obviously these are going to be quite subjective and I doubt they will work for everyone. I’ll focus mostly on what I think are the key ingredients of a successful team. Maybe some people will find it interesting. Bare in mind that this only really applies to an operations team that supports a Cloud service.

I’m not a big football fan but I can draw some parallels between football managers and DevOps teams. You don’t see Arsenal winning and losing games based on their process redesigns. I may be simplifying, and I’m sure tactics plays a large part, but I believe you get quite a bit more out of a team when you have excellent players. Players who excel in different areas. My teams tend to be 5 – 7 players nowadays and between all of us we need to cover a few areas.

The first is product knowledge.. If you have a product guru in your team then you’ve got a productivity catalyst. So many aspects of our work involves investigating whether issues are product vs config and whether we can improve things from an operational perspective that will result in the product running better. The most recent team has a Product Architect and he’s awesome. He’s on the cutting edge of ideas for the product, for Amazon AWS and for all of the supporting technologies. Having a dedicated resource to do all of this in the background is great – it means that when we automate his prototypes and release them we get the maximum benefit. Recent examples include our Public API work and the work being done on our Amazon architecture to improve speed (CDN’s etc).

The second role I’ve always tried to fill is an engineer (at least one person, preferably two). Get the most senior developer(s) that you can, who knows the language of the product and build system of the product that you are supporting. You can now write the high level instrumentation that every DevOps teams need – as is true with any automation project. There is only ever so far you can go with Bash (I tend to take things beyond where they are supposed to be with Bash as it is). Ultimately having a senior developer or two buys you a massive amount of flexibility. Need a web service for something like externalised Puppet variables?.. you can write your own. Backup scripts not fast enough?.. a senior developer will make those scripts look very feeble in comparison when rewritten in their preferred language and multithreaded. I’m careful about not reinventing the wheel and will usually go off and clone something from Github before starting from scratch myself. But having some people who can write stuff from scratch is a major advantage. One caveat I would say for this role – hire from outside. Developers usually end up getting pulled back to work on stuff they did at the company at some stage. If you can, hire a new person and liven things up. Obviously tell the engineering teams that the hire(s) are for instrumentation in case they get worried that you want to start adding buttons to the product :)

Lastly, the sysadmins. I’d actually consider myself one of these at heart. Getting a good sysadmin can be tricky. It’s not uncommon to read 100 CV’s before finding someone even remotely eligible.  For a DevOps team you need a reasonably rare mix of skills.. people who know linux inside out, who can script and get excited by the latest batch of tools, and nowadays you need to throw Puppet / Chef into the mix. I have a couple of these currently and consider myself extremely blessed. Everything that we do is checked into source control (we use AWS as our data center) and this buys us a lot of things.. like the ability to automate everything, reduce costs by deleting and recreating at whim and disaster recovery. However, you pay for those things buy hiring really good people.. which is a cost saving in the long run once the cost saving benefits of the team start to show.

Now if you add in all of those types of role.. what I’ve found works quite nicely is running the team without being too focussed on the separation of responsibility. Everyone is on call 24/7. Everyone is expect to know the product inside out (although nobody will get near the level of expertise of the Product Architect), everyone scripts (even me) and ultimately everyone will end up doing some programming tasks. You can probably see from Matt’s previous blog posts about the Metrics project he got the chance to learn some Ruby. I think it’s important that everyone knows a bit of everyone else’s job.. although when under pressure everyone naturally drops back into doing what they are good at to speed things along.

This probably looks a little odd from the outside. But it makes things fun, everyone stays engaged and ultimately we all share the same goal: scale to 1 million users :D