Openstack

I played with hardware!

It has been over a year since I had to play with hardware properly to achieve something practical, but that is part of the joy of being in the world of cloud computing. That world where you don’t own anything, you pay by the hour and occasionally things go horribly wrong but you delete it an start again; the throw away society of cloud computing.

Every now and then I get frustrated with AWS, normally because there is a something wrong, lets say a box that is meant to have unrivalled resources starts going slow, you end up doing some investigation but the answer is simply that the underlying hypervisor is busy, probably due to other people hammering the server for some reason… Either way in a cloudy world your choices are thus:

  1. Wait it out
  2. throw it away

You could hope the problem gets better or you could delete the server and build it somewhere else and hope that one is better, rinse and repeat the above two until a stable service is resumed.

Being throw away is really useful, it enables you to re-build quickly and not suffer to much if something major happens so I think people (you…) should make sure that no matter where your server is you can rebuild it from scratch in less than 10 mins. If you had the ability to still be throw away and request servers as and when you wanted via a WebUI or a CLi or some API calls but in addition to all of that you had the control of the physical hardware you could optimise what was running on the hypervisor to offer the best performance, this is all very good but is not with out its draw backs; someone has to physical rack / cable in all of the servers that are running the infrastructure, someone has to firmware patch them and replace dead hard drives and do all of that Boring stuff that cloud folk have forgotten about.

So what about Openstack

So for those that don’t know OpenStack is a private cloud, this means you can run services in your data centre that mimic AWS, You get the Block storage (EBS) in the form of Cinder, you get Object storage (S3) Instance storage (EC2) and a host of other things that I won’t go into. So the API may not be 100% the same as AWS and the features that you have in AWS may not be available in Openstack yet, but it’s catching up and it’s doing so rapidly. I would predict that over the next 2-3 years we see openstack compete with AWS for features and even start seeing AWS taking features that openstack has and porting them to AWS. So definitely one to watch.

Over the last 3-6 months it had come up a few times about openstack and I put it on my todo list to have a play but quite frankly I had other things to be doing. Well last week I was asked to help set up the SAN and network for a openstack PoC for the internal IT, falling back on my not as legacy as I’d like Cisco skills and having used the same SAN tech before it wasn’t long to get that set up and I thought it would take ages to get the various components of openstack up and working. Well it could have if it wasn’t for one saving grace, the PoC on a disk that Rackspace provide Here it may not be the latest or the most perfect but it saved a lot of time in getting something up and working and if you aren’t sure what it is I would suggest getting a few bits of legacy kit and having a play like we did, just set aside two or three days to play with the technology and to set up the various elements of it, it’s worth a play.

There’s already a few advantages of openstack vs aws, a silly one for me is a console. Openstack gives you VNC access to your servers, you can now survive any minor iptables glitch or networking mishap by your self, yes I know it should all be throw away, but sometimes the box has some data on it that is important or you want to know what went wrong and having a console is good. Lets not overlook the fact that you’re calling the shots so if it doesn’t do what you want it too you could if you wanted commit code back to make it better, change the hardware spec, distribution of VM’s or any other element in a thousand that you may need to control, with this you can.

But it’s not all good, it still comes back to managing your own data centre and there’s very few companies or services that get to a size where they have to move off of AWS for performance reasons, typically you’d move off of AWS to save a few dollars, but by the time you fator in additional head count for maintaining the physical boxes, power, cooling, rack locations, geographically diverse locations and the infrastructure services, the platform and it’s skill set you may not be saving as much money as you want, but you’ll probably break even with the advantage of controlling the whole underlying infrastructure on top of still having the throwaway nature a cloud services.

I’m not saying you should and could make it so you support everything all the time even high bursts of traffic, but at least you could use public cloud for what it’s good for, bursting onto when times get hard and more processing power is needed. Granted to be able to do that all systems would need to be automated and be able to migrate at the push of a button. By the time you’ve gone through that whole process with all of your applications either in a private cloud or in public cloud it wouldn’t matte rif you had to u-turn tomorrow you could do that. As long as you’re smart enough to oly use services that are available in multiple places i.e. in openstack and AWS.

Interesting times ahead I think.

Applying AMPs to Alfresco

A bit of background

Alfresco comes with the ability to be extended in a nice easy way, that is through the use of Alfresco Module Packages. In essence it is the delta of changes you would have made to the raw source code if you wanted to make some sort of customisation or apply one of the ones alfresco supplies like the S3-connector.

Over the last 3 years I’ve seen it done a number of ways, using the mmt tool to apply them manually, a shell script to do it and now I decided that wasn’t good enough.

Using the mmt tool manually is obviously not brilliant, some poor person has to sit there and run it to apply the amps. So you may have guessed this is not a good idea.

What about wrapping the mmt tool in a shell script that can be triggered by say a sysadmin to apply all the amps or just have it executed once per amp using some configuration managent tool like puppet. This is good. You put the amp into the configuration management tool push the right buttons and it magically get’s applied to the war files and all is well. Well sort of, what happens if someone just throws an amp on the server? who puts it in configuration management? who’s made a backup? Well I decided that I’d write a new script for applying amps so that it can be used both with a CM and as a ad-hock script.

What does it do?

I’ve written it so it will trawl through a directory and pull up every amp in the directory and it will apply the amps to alfresco or share as needed. What’s quite handy is that it will take several versions of an amp and work out what the latest version is, it will check the latest version against what is already installed in the war and then if the amp is a newer version it will apply it after making a backup.

For some odd reason I also made it cope with a variety of amp naming schemes, so you could upload alfresco-bob-123.amp or you could upload frank-share-super-1.2.3.4-5.amp it’s your amp, call it what you want. All the script cares about is the correlation of terms between the file name and the amp info when it’s installed. So as long as you use 2 words from the file name that also appear in the amp description it will work it out for you. The higher the correlation the more accurate it would be, it is configurable too but I set it to 2 occurrence of at least 2 words to match, so far… it’s working.

I also forgot to mention that the script will stop alfresco clear the caches and restart it for you in a pretty safe way.

A Script

Firstly I realise this is a bad format to get the script I’ve in the past put them in a git repo and shared it that way, I have put this one in a git repo and I hope to share that repo with some of the things we have done at alfresco that are useful for either running servers in general or for running alfresco either way I hope to shortly get that out on a public repo but for now here it is:

#!/usr/bin/ruby
# Require libs
$:.unshift File.expand_path("../", __FILE__)
require 'lib/logging'
require 'fileutils'
require 'timeout'

# Set up logging provider
Logging::new('alfresco_apply_amps')
Logging.log_level("INFO",false)
$log.info "Starting"

#CONSTANTS
BKUP_LOCATION="/var/lib/alfresco/alf_data/backups"
ALF_MMT="/var/lib/tomcat6/bin/alfresco-mmt.jar"
WEBAPPS_DIR="/var/lib/tomcat6/webapps"
AMP_LOCATIONS=["/var/lib/alfresco/alf_data/amps/"]

#Defaults
@restart=false

#Methods
def available_amps(amp_dir)
  #Get a list of Amps
  amps_list = `ls #{amp_dir}*`
  amps_array = amps_list.split("\n")
end

def backup(war)
  version=`/usr/bin/java -jar #{ALF_MMT} list #{WEBAPPS_DIR}/#{war}.war | grep Version | awk '{print $3}'`

  #Date stamp the war
  $log.info "Backing up #{WEBAPPS_DIR}/#{war}.war to #{BKUP_LOCATION}/#{war}-#{current_date}.war"
  `cp -a #{WEBAPPS_DIR}/#{war}.war #{BKUP_LOCATION}/#{war}-#{current_date}.war`
end

def clear_caches()
  $log.debug "Cleaning caches"  
  delete_dir("#{WEBAPPS_DIR}/alfresco/")
  delete_dir("#{WEBAPPS_DIR}/share/")
  delete_dir('/var/cache/tomcat6/work/',true)
  delete_dir('/var/cache/tomcat6/temp/',true)
  $log.info "Caches cleaned"  
end

def compare_strings(str1,str2,options={})
  matches = options[:matches] || 2
  frequency = options[:frequency] || 2

  #Make one array of words
  words=Array.new
  words << str1.split(' ') << str2.split(' ')
  words.flatten!
  #Hash to store each unique key in and number of occurances
  keys = Hash.new
  words.each do |key|
    if keys.has_key?(key)
      keys[key] +=1
    else
      keys.merge!({key =>1})
    end
  end

  #Now we have a Hash of keys with counts how many matches and what frequency
  #where a match is a unique key >1 and frequency si the count of each key i.e. 
  #matches=7 will mean 7 keys must be >1 frequency=3 means 7 matches must be > 3
  
  act_matches=0
  keys.each_pair do |key,value|
    if value >= frequency
      act_matches +=1
    end
  end
  if act_matches >= matches
    true
  else
    false
  end
end

def compare_versions(ver1,ver2)
  #return largest
  v2_maj=0
  v2_min=0
  v2_tiny=0
  v2_release=0
  v1_maj=0
  v1_min=0
  v1_tiny=0
  v1_release=0
  if ver1 =~ /\./ && ver2 =~ /\./
    #both are dotted notation
    #Compare maj -> release

    #Conver '-' to '.'
    ver1.gsub!(/-/,'.')
    ver2.gsub!(/-/,'.')

    v1_maj = ver1.split('.')[0]
    v1_min = ver1.split('.')[1] || 0
    v1_tiny = ver1.split('.')[2] || 0
    v1_release = ver1.split('.')[3] || 0

    v2_maj = ver2.split('.')[0]
    v2_min = ver2.split('.')[1] || 0
    v2_tiny = ver2.split('.')[2] || 0
    v2_release = ver2.split('.')[3] || 0

    if v1_maj > v2_maj
      return ver1
    elsif v1_min > v2_min
      return ver1
    elsif v1_tiny > v2_tiny
      return ver1
    #Don't compare release for now as some amps don't put the release in the amp when installed so you end up re-installing
    #elsif v1_release > v2_release
    #  return ver1
    else
      return ver2
    end
  else
    #Validate both are not-dotted
    if ver1 =~ /\./ || ver2 =~ /\./
      $log.debug "Eiher both types aren't the same or there's only one amp"
      return ver2
    else
      result = ver1<=>ver2
      if result.to_i > 0 && !result.nil?
        return ver1
      else
        return ver2
      end
    end
  end
end

def current_date()
  year=Time.now.year
  month=Time.now.month
  day=Time.now.day
  if month < 10
    month = "0"+month.to_s
  end
  if day < 10
    day = "0"+day.to_s
  end
  "#{year.to_s+month.to_s+day.to_s}"
end

def current_version(app, amp_name)

#
# THIS needs to cope with multiple amps being installed, produce a array hash [{:amp=>"ampname",:version => ver},etc]
#

  if app == "alfresco" || app == "share"
    amp_info = `/usr/bin/java -jar #{ALF_MMT} list #{WEBAPPS_DIR}/#{app}.war`
    amp_title=""
    amp_ver=0
    #$log.debug "Amp info: #{amp_info}"
    amp_info.each_line do |line|
      if line =~ /Title/
        amp_title=line.split("Title:").last.strip.gsub(%r/(-|_|\.)/,' ')
      elsif line =~ /Version/
        # strip/replace ampname, downcase etc
        if compare_strings(amp_name.gsub(%r/(-|_|\.)/,' ').downcase,amp_title.downcase)
          amp_ver=line.split("Version:").last.strip
          $log.info "Installed Amp found for #{amp_name}"
          $log.debug "Installed version: #{amp_ver}"
        else
          $log.debug "No installed amp for #{amp_name} for #{app}"
        end
      end
    end
  else
    $log.warn "The application #{app} can not be found in #{WEBAPPS_DIR}/"
  end
  return amp_ver
end

def delete_dir (path,contents_only=false)
  begin
    if (contents_only)
      $log.debug "Removing #{path}*"
      FileUtils.rm_rf Dir.glob(path+"*")
    else
      $log.debug "Removing #{path}"
      FileUtils.rm_rf path
    end
  rescue Errno::ENOENT
    $log.warn "#{path} Does not exist"
  rescue Erro::EACCES
    $log.warn "No permissions to delete #{path}"
  rescue
    $log.warn "Something went wrong"
  end
end

def firewall(block=false)
  if block
    `/sbin/iptables -I INPUT -m state --state NEW -m tcp -p tcp  --dport 8080 -j DROP`
  else
    `/sbin/iptables -D INPUT -m state --state NEW -m tcp -p tcp  --dport 8080 -j DROP`
  end
end

def get_amp_details(amps)
  amps_hash = Hash.new
  amps.each do |amp|
    amp_hash = Hash.new
    #Return hash with unique amps with just the latest version
    amp_filename = amp.split("/").last
    amp_path = amp
    amp_name=""
    amp_version=""
    first_name=true
    first_ver=true
    #Remove the ".amp" extension and loop through
    amp_filename[0..-5].split("-").each do |comp|
      pos = comp =~ /\d/
      if pos == 0
        if first_ver
          amp_version << comp
          first_ver=false
        else
          #By commenting this out the release will get ignored which because some amps to put it in their version is probably safest
          #amp_version << "-" << comp
        end
      else
        if first_name
          amp_name << comp.downcase
          first_name=false
        else
          amp_name << "_" << comp.downcase
        end
      end
    end

    #If a key of amp name exists, merge the version down hash else merge the lot
    if amps_hash.has_key?(amp_name)
      amp_hash={amp_version => {:path => amp_path, :filename => amp_filename}}
      amps_hash[amp_name].merge!(amp_hash)
    else 
      amp_hash={amp_name =>{amp_version => {:path => amp_path, :filename => amp_filename}}}
      amps_hash.merge!(amp_hash)
    end
  end
  return amps_hash
end

def install_amp(app, amp)
  $log.info "applying amp to #{app}"
  $log.warn "amp path must be passed!" unless !amp.nil?

  $log.debug "Command to install = /usr/bin/java -jar #{ALF_MMT} install #{amp} #{WEBAPPS_DIR}/#{app}.war -nobackup -force"
  `/usr/bin/java -jar #{ALF_MMT} install #{amp} #{WEBAPPS_DIR}/#{app}.war -nobackup -force`
  restart_tomcat?(true)
  $log.debug "Setting flag to restart tomcat"
end

def latest_amps(amp_hash)
  amp_hash.each_pair do |amp,amp_vers|
    latest_amp_ver=0
    $log.debug "Comparing versions for #{amp}"
    amp_vers.each_key do |version|
      $log.debug "Comparing #{latest_amp_ver} with #{version}"
      latest_amp_ver = compare_versions(latest_amp_ver,version)
      $log.info "Latest version for #{amp}: #{latest_amp_ver}"
      if latest_amp_ver != version
        amp_vers.delete(version)
      end
    end
  end
  return amp_hash
end

def next_version?(ver, current_ver, app)
  #Loop through amp versions to work out which is newer than the installed
  #Turn list into array
  next_amp=false
  $log.debug "if #{ver} > #{current_ver}"
  if ( ver.to_i > current_ver.to_i)
    $log.debug "Next #{app} amp version to be applied:  #{ver}"
    next_amp=true
  end
end

def restart_tomcat()
  #If an amp was applied restart
  if (restart_tomcat?)
    $log.info "Restarting Tomcat.... this may take some time"
    $log.debug"Getting pid"
    if (File.exists?('/var/run/tomcat6.pid') )
      pid=File.read('/var/run/tomcat6.pid').to_i
      $log.debug "Killing Tomcat PID= #{pid}"
      begin
        Process.kill("KILL",pid)
        Timeout::timeout(30) do
          begin
            sleep 5
            $log.debug "Sleeping for 5 seconds..."
          end while !!(`ps -p #{pid}`.match pid.to_s)
        end
      rescue Timeout::Error
        $log.debug "didn't kill process in 30 seconds"
      end
    end
    $log.debug "Killed tomcat"

    #Clear caches
    clear_caches
    $log.info "blocking firewall access"
    firewall(true)
    $log.debug "starting tomcat"
    `/sbin/service tomcat6 start`
    if ($?.exitstatus != 0)
      $log.debug "Tomcat6 service failed to start, exitstatus = #{$?.exitstatus}"
    else
      #Tomcat is starting sleep until it has started
      #For now sleep for 180 seconds
      $log.info "Sleeping for 180 seconds"
      sleep 180
      $log.info "un-blocking firewall access"
      firewall(false)
    end
  else
    $log.info "No new amps to be installed"
  end
end

def restart_tomcat?(bool=nil)
  @restart = bool unless bool.nil?
  #$log.debug "Restart tomcat = #{@restart}"
  return @restart
end

# - Methods End

#
# doGreatWork()
#

#Store an Hash of amps
amps=Hash.new

#For each AMP_LOCATIONS find the latest Amps
AMP_LOCATIONS.each do |amp_loc|
  $log.debug "Looking in #{amp_loc} for amps"
  amps.merge!(get_amp_details(available_amps(amp_loc)))
end

#Sort through the array and return only the latest versions of each amp
latest_amps(amps)

amps.each do |amp, details|
  #The Amps in here are the latest of their kind available so check with what is installed
  details.each_pair do |version,value|
    if amp =~ /share/
      if next_version?(version,current_version("share",amp),"share")
        $log.debug "Backing up share war"
        backup("share")
        $log.info "Installing #{amp} (#{version}): #{value[:path]}"
        install_amp("share",value[:path])
      else
        $log.info "No update needed"
      end
    else
      if next_version?(version,current_version("alfresco",amp),"alfresco")
        $log.debug "Backing up alfresco war"
        backup("alfresco")
        $log.info "Installing #{amp} (#{version}): #{value[:path]}"
        install_amp("alfresco",value[:path])
      else
        $log.info "No update needed"
      end
    end
  end
end

$log.debug "Restart tomcat?: #{restart_tomcat?}"
restart_tomcat

$log.info "All done for now"

Okay 2 things, it’s a long script all in one file to make it easy to transport, I’ve also used a logging class that enables logging to screen / file that is …below :) you could also just remove the require at the top and replace “$log.debug” with “puts” up to you.

#
#   Set up Logging
#

require 'rubygems'
require 'log4r'

class Logging

  def initialize(log_name,log_location="/var/log/")
    # Create a logger named 'log' that logs to stdout
    $log = Log4r::Logger.new log_name

    # Open a new file logger and ask him not to truncate the file before opening.
    # FileOutputter.new(nameofoutputter, Hash containing(filename, trunc))
    file = Log4r::FileOutputter.new('fileOutputter', :filename => "#{log_location}#{log_name}.log",:trunc => false)

    # You can add as many outputters you want. You can add them using reference
    # or by name specified while creating
    $log.add(file)
    # or mylog.add(fileOutputter) : name we have given.

    # As I have set my logging level to ERROR. only messages greater than or 
    # equal to this level will show. Order is
    # DEBUG < INFO < WARN < ERROR < FATAL

    # specify the format for the message.
    format = Log4r::PatternFormatter.new(:pattern => "[%l] %d: %m")

    # Add formatter to outputter not to logger. 
    # So its like this : you add outputter to logger, and add formattters to outputters.
    # As we haven't added this formatter to outputter we created to log messages at 
    # STDOUT. Log messages at stdout will be simple
    # but the log messages in file will be formatted
    file.formatter = format
    
  end

  def self.log_level(lvl,verbose=false)
    # You can use any Outputter here.
    $log.outputters = Log4r::Outputter.stdout if verbose

    # Log level order is DEBUG < INFO < WARN < ERROR < FATAL
    case lvl
        when    "DEBUG"
            $log.level = Log4r::DEBUG
        when    "INFO"
            $log.level = Log4r::INFO
        when    "WARN"
            $log.level = Log4r::WARN
        when    "ERROR"
            $log.level = Log4r::ERROR
        when    "FATAL"
            $log.level = Log4r::FATAL
        else
             print "You provided an invalid option: #{lvl}"
    end
  end

end

I hope this helps people out, if there’s any issues just leave comments and i’ll help :)

I did it, Plugins

I said I couldn’t do it

It was not long ago I said in the Sentinel Update I didn’t know how to do plugins. Well less than a week after writing it I was reading a few articles by Gregory Brown on modules and Mixins. These are the first time I’ve read something that explains them in a way I actually understand.

I was doing research into modules and mixins as they seemed a bit pointless but thanks to the articles by Gregory I was able to understand them and right in the middle of reading some of the examples and having a play a lightning bolt struck me, it all became clear on how to implement the a plugin manager.

Some Bad Code

Based on some of the stuff I saw I came up with the following, ignore most of it I was just hacking around to see if I could get it to work the names meant more in a previous iteration.

module PluginManager
#Just seeing if this works like magic...
    class LoadPlugin
        def initialize
            #The key is a plugin_name the &block is the code so in theory when initialzed it can be run
            @plugins={} unless !@plugins.nil?
        end

        def add_plugin (key,&block)
            @plugins.merge!({key=>block})
        end

        def run_plugin (key)
            puts "Plugin to Run = #{key}"
            puts "Plugin does:\n"
            @plugins[key].call

        end

        def list_plugins
            @plugins.each_key {|key| puts key}
        end
    end

end

plugins = PluginManager::LoadPlugin.new

plugins.add_plugin (:say_hello) do
    puts "Hello"
end

plugins.add_plugin (:count_to_10) do
    for i in 0..10
        puts "#{i}"
    end
end

plugins.add_plugin (:woop) do
    puts "Wooop!"
end

plugins.add_plugin (:say_goodbye) do
    puts "Good Bye :D"
end

puts "in theory... Multiple plugins have been loaded"
puts "listing plugins:"
plugins.list_plugins
puts "running plugins:"
plugins.run_plugin (:say_hello)
plugins.run_plugin (:woop)
plugins.run_plugin (:count_to_10)
plugins.run_plugin (:say_goodbye)

And when it runs:

in theory... Multiple plugins have been loaded
listing plugins:
say_hello
count_to_10
woop
say_goodbye
running plugins:
Plugin to Run = say_hello
Plugin does:
Hello
Plugin to Run = woop
Plugin does:
Wooop!
Plugin to Run = count_to_10
Plugin does:
0
1
2
3
4
5
6
7
8
9
10
Plugin to Run = say_goodbye
Plugin does:
Good Bye :D

This is good news I and I really like the site, I will be using it a lot more as I learn more about ruby it explains things really well, and it looks like if you can afford the $8/month you can get a lot more articles by the same guy at practicingruby.com

Summary

So in short… Sentinel will have plugins, I like the blogs at ruby best practices and This blog will also be short :D

soimasysadmin.com

What’s in a name!

I took the decision today to set up soimasysadmin.com to point to this blog there’s a number of reasons for this:

  1. Looks cooler!
  2. Blog could be transferred at a later date
  3. I needed something to do

It came about a couple of days ago when I was looking at another wordpress based site that was being hosted else where. I do run my own servers at home but I have home broadband and it probably isn’t as good as what wordpress can supply so I made the decision to let it be hosted else where. However, having seen some nicely themed wordpress sites and the versatility of it as a platform I’ve been quite impressed.

One of the other reasons for looking at this as an option is the ability to manage at a greater level the content of the site, such as hooking in google analytics or putting my own adwords in place all things to be considered for the future and as I see this as a long term game i’m better off making the change now.

I’m going to leave it a few weeks before I actually flip over but I wanted to get the domain out and about and make sure that I can get a couple of referrers updated to help with the transition, hopefully it won’t have a major impact, but who knows!

Originally when I started out I wasn’t sure how long I would keep this going but it’s become a bit of a dumping ground for good information that I’ve learnt over the years and hopefully it’s been useful to more than just myself so over the next year I’ll be thinking about and maybe playing with a few other ideas all of which are helped by having the domain in place.

Oh… I also updated my About page to now have a feedback form so you can use this to contact me rather than commenting if you so wish :)

AWS CopySnapshot – A regional DR backup

Finally!

After many months of talking with Amazon about better ways of getting backups from one region to another they sneak in a sneaky little update on their blog I will say it here, World changing update! The ability to easily and readily sync your EBS data between regions is game changing, I kid you not, in my tests I synced 100GB from us-east-1 to us-west-1 in such a quick time it was done before I’d switched to that region to see it! However… sometimes it is a little slower… Thinking about it, it could have been a blank volume I don’t really know :/

So at Alfresco we do not heavily use EBS as explained Here when we survived a major amazon issue that affected many larger websites than our own. We do still have EBS volumes as it is almost impossible to get rid of them, and by the very nature the data that is on these EBS volumes is very precious so obviously we want it backed up. A few weeks ago I started writing a backup script for EBS volumes, well the script wasn’t well tested it was quickly written but it worked. I decided that I would very quickly, well to be fair I spent ages on it, update the script with the new CopySnapshot feature.

At the time of writing, the CopySnapshot exists in one place, the deepest, darkest place known to man, the Amazon Query API interface; this basically means that rather than simply doing some method call you have to throw all the data to it and back again to make it hang together, for the real programmers out there this is just an inconvenience for me it is a nightmare, it was an epic battle between my motivation, my understanding and my google prowess, in short I won.

It was never going to be easy…

In the past I have done some basic stuff with REST type API’s, set some header, put some variable on the params of the url and just let it go, all very simple, Amazon’s was slightly more advanced to say the least.

So I had to use this complicated encoded, parsed encrypted and back to back handshake as described here with that and the CopySnapshot docs I was ready to rock!

So after failing for an hour to even get my head around the authentication I decided to cheat, and use google. The biggest break through was thanks to James Murty the AWS class he has written is perfect, the only issue was my understanding on how to use modules in ruby which were very new to me. On a side note i thought Modules were meant to fix issues with name space but for some reason even though I included the class in script it seemed to conflict with the ruby aws-sdk I already had so I just had to rename the class / file from AWS to AWSAPI and all was then fine. That and I also had to add a parameter to pass in the AWS_ACCESS_KEY which was a little annoying as I thought the class would have taken care of that, but to be fair it wasn’t hard to work out in the end.

So first things first, have a look at the AWS.rb file on the site it does the whole signature signing bit well and saves me the hassle of doing or thinking about it. On a side note, this all uses version 2 of the signing which I imagine will be deprecated at some point as version 4 is out and about Here

If you were proactive you’ve already read the CopySnapshot docs and noticed that in plain english or complicated that page does not tell you how to copy between regions. I imagine it’s because I don’t know how to really use the tools but it’s not clear to me… I had noticed that th wording they used was identical to the params being passed in the example so I tried using Region, DestinationRegion, DestRegion all failing, kind of as expected seeing as I was left to guess my way through; I was there, that point where you’ve had enough and it doesn’t look like it is ever going to work so I started writing a support ticket for Amazon so they could point out what ever it was I was missing at the moment of just about to hit submit I had a brainwave. If the only option is to specify the source then how do you know the destination? well, I realised that each region has its own API url, so would that work as the destination? YES!

The journey was challenging, epic even for this sysadmin to tackle and yet here we are, a blog post about regional DR backups of EBS snapshots so without further ado, and no more gilding the lily I present some install notes and code…

Make it work

The first thing you will need to do is get the appropriate files, the AWS.rb from James Murty. Once you have this You will need to make the following changes:

21c21
< module AWS
---
> module AWSAPI

Next you will need to steal the code for the backup script:

#!/usr/bin/ruby

require 'rubygems'
require 'aws-sdk'
require 'uri'
require 'crack'

#Get options
ENV['AWS_ACCESS_KEY']=ARGV[0]
ENV['AWS_SECRET_KEY']=ARGV[1]
volumes_file=ARGV[2]
source_region=ARGV[3]
source_region ||= "us-east-1"

#Create a class for the aws module
class CopySnapshot
  #This allows me to initalize the module with out re-writing it
  require 'awsapi'
  include AWSAPI

end

def get_dest_url (region)
  case region
  when "us-east-1"
    url = "ec2.us-east-1.amazonaws.com"
  when "us-west-2"
    url = "ec2.us-west-2.amazonaws.com"
  when "us-west-1"
    url = "ec2.us-west-1.amazonaws.com"
  when "eu-west-1"
    url = "ec2.eu-west-1.amazonaws.com"
  when "ap-southeast-1"
    url = "ec2.ap-southeast-1.amazonaws.com"
  when "ap-southeast-2"
    url = "ec2.ap-southeast-2.amazonaws.com"
  when "ap-northeast-1"
    url = "ec2.ap-northeast-1.amazonaws.com"
  when "sa-east-1"
    url = "ec2.sa-east-1.amazonaws.com"
  end
  return url
end

def copy_to_region(description,dest_region,snapshotid, src_region)

  cs = CopySnapshot.new

  #Gen URL
  
  url= get_dest_url(dest_region)
  uri="https://#{url}"

  #Set up Params
  params = Hash.new
  params["Action"] = "CopySnapshot"
  params["Version"] = "2012-12-01"
  params["SignatureVersion"] = "2"
  params["Description"] = description
  params["SourceRegion"] = src_region
  params["SourceSnapshotId"] = snapshotid
  params["Timestamp"] = Time.now.iso8601(10)
  params["AWSAccessKeyId"] = ENV['AWS_ACCESS_KEY']

  resp = begin
    cs.do_query("POST",URI(uri),params)
  rescue Exception => e
    puts e.message
  end

  if resp.is_a?(Net::HTTPSuccess)
    response = Crack::XML.parse(resp.body)
    if response["CopySnapshotResponse"].has_key?('snapshotId')
      puts "Snapshot ID in #{dest_region} is #{response["CopySnapshotResponse"]["snapshotId"]}" 
    end
  else
    puts "Something went wrong: #{resp.class}"
  end
  
end

if File.exist?(volumes_file)
  puts "File found, loading content"
  #Fix contributed by Justin Smith: http://soimasysadmin.com/2013/01/09/aws-copysnapshot-a-regional-dr-backup/#comment-379
  ec2 = AWS::EC2.new(:access_key_id => ENV['AWS_ACCESS_KEY'], :secret_access_key=> ENV['AWS_SECRET_KEY']).regions[source_region]
  File.open(volumes_file, "r") do |fh|
    fh.each do |line|
      volume_id=line.split(',')[0].chomp
      volume_desc=line.split(',')[1].chomp
      if line.split(',').size >2
        volume_dest_region=line.split(',')[2].to_s.chomp
      end
      puts "Volume ID = #{volume_id} Volume Description = #{volume_desc}"
      v = ec2.volumes["#{volume_id}"]
      if v.exists? 
        puts "creating snapshot"
        date = Time.now
        backup_string="Backup of #{volume_id} - #{date.day}-#{date.month}-#{date.year}"
        puts "#{backup_string}" 
        snapshot = v.create_snapshot(backup_string)
        sleep 1 until [:completed, :error].include?(snapshot.status)
        snapshot.tag("Name", :value =>"#{volume_desc} #{volume_id}")
        # if it should be backed up to another region do so now
        if !volume_dest_region.nil? 
          if !volume_dest_region.match(/\s/) ? true : false
            puts "Backing up to #{volume_dest_region}"
            puts "Snapshot ID = #{snapshot.id}"
            copy_to_region(volume_desc,volume_dest_region,snapshot.id,source_region)
          end
        end
      else
        puts "Volume #{volume_id} no longer exists"
      end
    end
  end
else
  puts "no file #{volumes_file}"
end

Once you have that you will need to create a file with the volume sin to backup, in the following format:

vol-127facd,My vol,us-west-1
vol-1ac123d,My vol2
vol-cd1245f,My vol3,us-west-2

The format is “volume id, description,region” the region is where you want to backup to. once you have these details you just call the file as follows:

ruby ebs_snapshot.rb <Access key> <secret key> <volumes file>

I don’t recommend putting your key’s on the CLI or even in a cron job but it wouldn’t take much to re-facter this into a class if needed and if you were bothered about that.
It should work quite well if anyone has any problems let me know and I’ll see what I can do :)

Sentinel update

Many moons ago…

A while back I started to mention the idea of Self healing systems a dedicated system that makes use of monitoring and real time system information to make intelligent decisions about what to do, i.e. I write a complicated program to gradually replace my self. It was suggested about using hooks in Nagios to do the tasks but that misses the intelligence side of what I’m trying to get to, restarting based on Nagios checks is simply an if statement that on a certain condition does something, Sentinel will be more that that.

Back in April I started Sentinel as an open source project As expected the uptake has been phenomenal! absolutely no one has even looked at it :) Either way I am not deterred. I have been on and off re-factoring Sentinel into something a bit more logical Here and I have gone from 3 files to some 13! from 1411 words to 2906 and I even have one fully working unit test! I don’t think I’ll be writing more as at the moment it is not really helping me get to where I want to be quickly but I know I’ll need them at some point!

So far all I have done is split out some of the code to give it structure and added the odd bit here and there. The next thing I need to start doing is to make it better, there’s a number of options:

  • Writing more providers for it so it can start to manage disks, memory etc etc so it’s a bit more useful
  • Sorting out the structure of the code adding in more error handling / logging and resilience
  • Integration with Nagios or some tool that already monitors system health and use that to base actions off of
  • Daemonize Sentinel so it runs like a real program!
  • Configuration file rather than CLI

What to do

I think for me I’d rather sort out the structure of the code and improve what is already there first, I’m in no rush with this so the least I could do is make what I have less hacky. This also gives me the opportunity to start working out how I’d rather have the whole thing structured.

I did look at writing a plugin framework so it would be possible to just drop in a module or something similar and it would have the correct information about how to manage what ever it was written to do, but I figured that was a bit beyond me at this time and I have better things to do!

After that I think the configuration file and daemonizing the application, the main reason for this will be to identify any issues with it running continually any issue here would be nice to know sooner rather than later.

This then leaves more providers and nagios type integration which i’m sure will be fun.

Give it AI!

Once those items are done this leaves sentinel with one more thing to do, start intelligently working out solutions to problems, obviously I don’t know the right way to tackle this however I do have a few ideas though.

In my head… I think how I would solve an issue and inevitably it starts with gathering information about the system, but how do you know what information is relavent to which problems and how much weighting should it have? well for starters I figure each provider would return a score about how healthy it thinks it is. So for example:

A provider for checking the site is available notices that it’s not available; this produces a score that is very high say 10000. It then makes sure it’s got the latest information from all providers on the server. One of those providers is disk which notices one of the volumes is 67% full but the thresholds have been set to warn at 70 and 95 % so it sets a score of say 250 and is ranked in a list somewhere to come back to if all else fails.

At this point it is unlikely that disk is the culprit, we have to assume that whomever set the thresholds knew something about the system, so more information is needed, it checks the local network and gets back a score of 0 as far as the network provider can tell it’s working fine it can get to localhost, the gateway another gateway on the internet. A good test at this point is to try and work out which layer of the OSI model the issue is, so one of the actions might be to connect to port 80 or 443 or both and see what happens, is there a web response? or not, if there is does it have any words in it or a response code that suggests it’s a known web error like a 500 or does it not get connected.

And so on and so forth, this would mean that where ever this logic exists it has to make associations betten results and the following actions. one of the ways to do this is to “tag” a provider with potential subsystems that could affect it then based on the score of each of the subsystems produce a vector of potential areas to check, combined with the score it’s possible to travel the vector and work out how likely each is to fix the issue, as and when each one produces a result it either dives in a new vector either more detailed or not. It would then, in theory be possible to start making correlations between these subsystems, so say the web one requires disk and networking to be available and both the networking and disk require CPU then it can assume that web one needs that and base don how many of these connections exist it can rank it higher or lower much in the same way a search engine would work.

But all of this is for another day, today is just about saying it’s started and I hope to continue on it this year.