This time, We survived the AWS outage

Another minor bump

Anyone based in the US East region in AWS knows that yet again there were issues with EBS volumes, although you wouldn’t know it if you looked at their website. It’s a bit of a joke when you see headlines like Amazon outage takes down Reddit, Foursquare, others yet on their status page a tiny little note icon appears that states there’s a slight issue, extremely minor, don’t worry about it. Yeah right.

The main culprits were EC2 and the API, both of which were EBS related.

“Degraded EBS performance in a single Availability Zone
10:38 AM PDT We are currently investigating degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region.
11:11 AM PDT We can confirm degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region. Instances using affected EBS volumes will also experience degraded performance.
11:26 AM PDT We are currently experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region. New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance.
12:32 PM PDT We are working on recovering the impacted EBS volumes in a single Availability Zone in the US-EAST-1 Region.
1:02 PM PDT We continue to work to resolve the issue affecting EBS volumes in a single availability zone in the US-EAST-1 region. The AWS Management Console for EC2 indicates which availability zone is impaired. “

The actual message is much much longer but you get the gist, a small number of people were affected. Yet most of the major websites that use amazon were affected, how can that be considered small?

Either way, this time we survived, and we survived because we learnt. Back in June and July we experienced these issues with EBS so we did something about it, now why didn’t everyone else?

How Alfresco Cloud Survived

So back in June and July we were heavily reliant on EBS just like everyone else, we had an EBS backed AMI that we then used puppet to build out the OS, this is pretty much what everyone does and this is why everyone was affected, back then we probably had 100 – 150 EBS volumes so the likely hood of one of them going funny was quite high, now we have about 18, and as soon as we can we will ditch those as well.

After being hit twice in relatively quick succession we realised we had a choice, be lazy or be crazy, we went for crazy and now it paid out. We could have been lazy and just said that Amazon had issues and it wasn’t that frequent and it is not likely to happen again, or we could be crazy and reduce all of our EBS usage as much as possible, we did that.

Over the last few months I’ve added a numer or posts about The Cloud, Amazon and Architecting for the cloud along with a few funky Abnormal puppet set ups and oddities in the middle. All of this was spawned from the EBS outages, we had to be crazy, Amazon tell us all the time don’t have state, don’t rely on anything other than failure use multiple AZ’s etc etc all of those big players that were affected would have been told that they should use multiple availability zones, but as I pointed out Here their AZ can’t be fully independent and yet again this outage proves it.

Now up until those outages we had done all of that, but we still trusted Amazon to remain operational, since July we have made a concerted effort to move our infrastructure to elements within Amazon that are more stable, hence the removal of EBS. We now only deploy instance backed EC2 nodes which means we have no ability to restart a server, but it also means that we can build them quickly and consistently.

We possibly took it to the extreme, our base AMI, now instance backed, consists of a single file that does a git checkout, once it has done that it simply builds its self to the point that chef and puppet can take over and run. The tools used to do this are many but needless to say many hundreds of of lines of bash, supported by Ruby, Java, Go and any number of other languages or tools.

We combined this with fully distributing puppet so it runs locally, in theory once a box is built it is there for the long run; we externalised all of the configuration so puppet was simpler and easier to maintain. Puppet, its config, the Base OS, the tools to manage and maintain the systems are all pulled from remote services including our DNS which automatically updates its self based on a set of tags.

Summary

So, how did we survive, we decided every box was not important, if some crazy person can not randomly delete a box or service and the system keeps working then we had failed. I can only imagine that the bigger companies with a lot more money and people and time looking at this are still treating Amazon more as a datacentre rather than a collection of web services that may or may not work. With the distributed puppet and config once our servers are built they run happily on a local copy of the data, no network, and that is important because AWS’s network is not always reliable and nor is their data access. If a box no longer works delete it, if an environment stops working rebuild it; if amazon has a glitch, keep working, simple.

Metrics from Amazon

Amazon have an API for that

It was words to that affect I heard, I wish it was what I wanted though. After watching a colleague get metrics for the billing it was clear after a short look they were very good for estimating your total expense but they were never going to be an accurate figure.

So in short if you want to be disappointed by Amazon guessing your monthly cost you can find out some information Here.

If you wish to learn more, read on…

What do you want?

We were looking at the metrics because we wanted to report on our running costs on a granular basis so we could see the hourly cost of our account as and when new services were turned on or off or if we added / subtracted a node from an existing system. In addition to that snapshot we wanted to be able to compare one week to the next and with some other operational metrics such as users on line.

So after discussing for a a little while it was clear the Amazon metrics are good for predicting, not good for historical and ultimately not very accurate as it was always a potential and never an actual. I made the decision that we were better off getting the information our selves, which at first sounded crazy, and the more I think about it the more I agree, really amazon should offer this in a better way through their current billing.

Knowing what we wanted meant I could not bother with tracking the things we don’t care about. What is really important to us, is it the disk space being used? is it the bandwidth? cost of ELB’s? Nope, we just really care about how much does it cost for the instances we are running.

In the end that’s all that mattered, are we costing more money or less money and with this money are we providing more or less value. Ideally we will reduce cost and increase performance but until we start tracking the figures we have no idea what is actually going on unless we spend hours looking at a summarised cost and guessing backwards… well until now anyway.

Over the last 5 days I’ve spent some time knocking together some ruby scripts that will poll amazon and get back the cost of an account based on the current running instances across all regions. For us that is good enough, but I decided to add extra functionality by getting all fo the nodes as well, it sort of acts like an audit trail and will allow us to do more in depth monitoring if we so wish, for example… if we switch instance type does that save us more money and give more performance? Well we wouldn’t know either, especially if we didn’t track what was running.

We also wanted to track the number of objects in a bucket within S3, now our S3 buckets have millions objects in each of them, if you use the aws-sdk to get this it will take forever +1 day, if you use the right_aws it will still take a long time but not as much (over 30 mins for us). This isn’t acceptable so we’re looking at alternative ways to generate this number quicker, but the short answer is it’ll be slow, if I come up with a fancy s3 object count alternative i’ll let you know, but for now I have had to abandon that. Unless Amazon want to add two simpel options like s3.totalObjects and s3.totalSize…

It’s just data

So, we gathered our Amazon data, we gathered some data from our product, this was about a day into the project, all of this was done currently on the fly and it use to take 20 seconds or so to get the information. We had a quick review and it was decided we should track now vs the last week, this made a slight difference as it meant we now needed to store the data.

This is a good thing, by storing the data we care about we no longer have to make lots of long winded calls that hang for 20 seconds, it’s all local, speed++

Needless to say the joys of storing the data and searching back through it is all interesting, but I’m not going to go into them.

To take the data and turn it into something useful requires reports to be written, all the time it’s raw data no one will particularly care, once a figure is associated with a cost per user or a this account costs X per hour people care more. One of the decisions made was to take the data and do some abstraction of it separately to the formatting of the output, mainly as there’s multiple places to send the data, we might want to graph some in graphite, email some of the other data and squirrel away the rest in a CSV output somewhere.

An advantage of this is now that the data has been generated there’s one file to modify to chose how and what data should get returned which gives us the ability to essentially write bespoke reports for what ever is floating our boat this week.

A Freebie

I thought this might be useful, It’s the class we are using to get our instance data from amazon, it’s missing a lot error checking but so far it is functional, and as everyone knows before it can be useful it must first work.

#
# Get instance size cost
#

require 'net/http'
require 'rubygems'
require 'json'
require 'aws-sdk'

class AWSInstance
 
  def initialize (access_key, secret_key)
      @access_key_id=access_key
      @secret_access_key=secret_key
  end

  def cost 
    instances = get_running_instances
    cost = 0.00
    #Calc Cost
    instances.each_pair do |region,value|
      value.each_pair do |instance_type, value|
        cost += (value.size.to_f * price(instance_type,region).to_f)
      end
    end
    return cost
  end

  def get_instances
    #Return all running instaces as that has a cost in its hash
    return get_running_instances
  end

  private
  def price (api_size, region)
    rhel_url = "/rhel/pricing-on-demand-instances.json"
    url = "/ec2/pricing/pricing-on-demand-instances.json"
    price = 0
    size=""
    instance_type=""
    response = Net::HTTP.get_response("aws.amazon.com",rhel_url)
    #puts response.body 
    # Convert to hash
    json_hash = JSON.parse(response.body)

    # api_size i.e. m1.xlarge
    # Get some info to help looking up the json
    case api_size
      when "m1.small"
        size = "sm"
        instance_type = "stdODI"
      when "m1.medium"
        size = "med"
        instance_type = "stdODI"
      when "m1.large"
        size = "lg"
        instance_type = "stdODI"
      when "m1.xlarge"
        size = "xl"
        instance_type = "stdODI"
      when "t1.micro"
        size = "u"
        instance_type = "uODI"
      when "m2.xlarge"
        size = "xl"
        instance_type = "hiMemODI"
      when "m2.2xlarge"
        size = "xxl"
        instance_type = "hiMemODI"
      when "m2.4xlarge"
        size = "xxxxl"
        instance_type = "hiMemODI"
      when "c1.medium"
        size = "med"
        instance_type = "hiCPUODI"
      when "c1.xlarge"
        size = "xl"
        instance_type = "hiCPUODI"
      when "cc1.4xlarge"
        size = "xxxxl"
        instance_type = "clusterComputeI"
      when "cc2.8xlarge"
        size = "xxxxxxxxl"
        instance_type = "clusterComputeI"
      when "cg1.4xlarge"
        size = "xxxxl"
        instance_type = "clusterGPUI"
      when "hi1.4xlarge"
        size = "xxxxl"
        instance_type = "hiIoODI"
    end
  # json_hash["config"]["regions"][0]["instanceTypes"][0]["sizes"][3]["valueColumns"][0]["prices"]["USD"]
    json_hash["config"]["regions"].each do |r|    
      if (r["region"] == region.sub(/-1$/,''))
        r["instanceTypes"].each do |it|
          if (it["type"] == instance_type)
            it["sizes"].each do |sz|
              if (sz["size"] == size)
                price=sz["valueColumns"][0]["prices"]["USD"]
              end
            end
          end 
        end
      end
    end
  
    return price
  end

  def get_running_instances
    #Set up EC2 connection
    ec2 = AWS::EC2.new(:access_key_id => @access_key_id, :secret_access_key=> @secret_access_key)
    instance_hash = Hash.new
    
    #Get a list of instances
    #Memorize cuts down on chatter
    AWS.memoize do
      ec2.regions.each do |region|
        instance_hash.merge!({region.name => {}})
        region.instances.each do |instance|
          if (instance.status == :running)
            #Need to create a blank hash of instance_type else merge fails
            if (!instance_hash[region.name].has_key?(instance.instance_type) )
              instance_hash[region.name].merge!({instance.instance_type => {}})
            end
            #For all running instances 
            tmp_hash={instance.id => {:env => instance.tags.Env, :role => instance.tags.Role, :name => instance.tags.Name, :cost => price(instance.instance_type,region.name) }}
            instance_hash[region.name][instance.instance_type].merge!(tmp_hash)
          end
        end
      end
    end
    return instance_hash
  end
end #End class

AWS best practice – Architecting the cloud

Architecting the Cloud

In this post I will go over some best practice to help you architect a solution that will hopefully survive most amazon incidents. To start with, let’s look at a single region and how to make the best use of a region.

Instances

Starting with the most basic steps first, you want to have each instance created be as stateless as possible and as light weight as possible. Ideally you would use instance-store backed instances as these do not rely on EBS to be working, so you are reducing your dependancy on the Amazon infrastructure and one less dependancy is one less thing to go wrong. If you can not avoid the use of an EBS backed instance then you will want to be ensuring that you have multiple instances providing the same service.

Also consider the use of your service, S3 is slow for you to download data and then share out again, but you could push the handling of the access off to Amazon helping make your environment a bit more stateless. it is also worth noting that there have been far fewer issues with S3 than EBS. Obviously if you need the capacity of EBS (S3 has a single file size limit of 5TB) then RAID the drives together for data storage. You can not do this for your instance storage but at least your data will be okay.

On a side note out of 200+ volumes during a recent outage we only had one with issues so they are quite reliable, although some times slow, however if your aim is ultimate uptime you should not rely on it.

Storage

As I pointed out before, your main storage types are EBS and S3, EBS is block device storage and as a result is just another hard drive for you to manage, you can set RAID on them or leave them as single disks. Then there is S3 which is a key value store which is accessed via a REST API to get the data.

With EBS and S3 it is never stated anywhere that your data is backed up. Your data is your responsibility, if you need a backup you ned to take snapshots of the data and if you want an “off site” equivalent you would need to make sure you have the EBS snapshot replicated to another region, the same applies for S3.

A big advantage of EBS is the speed to write and read from it, if you have an application that requires large amounts of disk space then this si your only real option without re-architecting.

S3 is Simple, hence the name, as a result it very rarely goes wrong but it does have a lot more limitations around it compared to EBS. One of them is down to the reliability, it won’t send a confirmation that the data has written until it has been written to two AZs, for light usage, and non time dependant work this is probably your best choice. S3 is ideal however for data that is going to be read a lot, one reason is it can easily be pushed into cloud front (A CDN) and as a result you can start offloading the work from your node.

In short where possible don’t store anything, if you do have to store it try S3 so you can offload the access if that is not adequate then fall back to EBS and make sure you have decent snapshots and be prepared for it to fail miserably.

Database Storage

RDS is a nice database service that will take care of a lot of hassel for you and I’d recommend that is used or DynamoDB. RDS is can be split across multiple AZs and the patch management is taken care of for you which leaves you to just configure any parameters you want and point your data to it. There are limitations with RDS of 1TB of database storage but in most cases I’d hope the application could deal with this some how else you are left trying to run a highly performant database in Amazon at which point you are on your own.

Unless of course you can make use of a non-rational database such as DynamoDB which is infinitely scalable and performant and totally managed for you. Sounds too good to be true, well of course, it is a non rational database and the scalability speed is limited, at the present moment in time you can only double the size and speed of your dynamoDB once per day, so if you are doing a marketing campaign you have to take this int account days in advance possibly.

Availability Zones

Hopefully by this point you have chosen the write type of instance and storage locations for any data, leaving you the joys of thinking about resilience. At a bear minimum you will want to have servers that provide the same functionality spread out across multiple AZs and some sort of balancing mechanism, be it an ELB, round robin DNS or latency based DNS.

The more availability zones your systems are in the more likely you are to be able to cope with any incidents, ideally you would take care of this through the use of auto scaling, that way if there is a major incident it will bring nodes back for you.

Having instances in multiple AZs will protect you in about 80% of cases, however, EBS and S3, although spread across multiple AZ’s are a single point of failure and I have seen issues where access to EBS backed instances is incredibly slow across a number of servers, in my case 50% of servers across multiple availability zones were all affected by accessibility of the data. So your service can not rely on a single region for reasons like this. One of the reasons I believe for this is when EBS fails there is some sort of auto recovery which can flood the network and cause some disruption to other instances.

A little known fact about AZs is that every client’s AZ is different. If you have 2 accounts with Amazon you may well get presented different AZs but even those with the same name may in fact be in different AZs and visa-versa.

Regions

With all of the above you can run a quite successful service all in one region with a reasonable SLA, if you are lucky to not have any incidents. At the very least you should consider making your backups into another region. multiple regions much like multiple data centres are difficult, especially when you have no control over the networking, this leaves you in a bit of a predicament. You can do latency based routing within Route53 or weighted Round Robin, in this case, assume a region is off line your traffic should be re-routed to the alternative address.

Things to watch out for

Over the months we’ve been hosting on AWS there’s been a number of occasions where things don’t work the way you expect them too and the aim of this section is to give you some pointers to save you the sorrow.

Instance updates
There has been a number of occasions where an instance has stopped working with no good reason, all of a sudden the network may drop a few packets, the IO wait may go high or just in general it is not behaving the way it should. In these situations, the only solution is to stop and start the instance, a little known fact is that the stop and start process will ensure that your instance is on hardware with the latest software updates. However, I have been told by AWS support that new instances may end up on hardware that is not optimal so as a result you should always stop and start new instances.

In severe cases Amazon will mark a node in a degraded state, but I believe they will only do this after a certain percentage of instances have migrated over or it has been degraded for a while.

Scaling up instance size

This is an odd one, predominantly because of a contradiction. You can easily scale up any instance by stopping it in the web gui and changing it’s size on the right click menu. This is good, you can have a short period of downtime and have a much larger instance, the downside being your IP and DNS will change as it is a stop and start. However, if you had deployed your instance via Cloud formations it would be able to scale up and down on the fly with a cloud formations script change.

Security with ELBs
With security groups you can add TCP, ICMP or UDP access rules to a group from another security group or from a network range thus securing instances in the same way a perimeter firewall would. However, this doesn’t guarantee security specifically if you then add an ELB to the front end. With ELB’s you do not know what the network would or could be for them so you ultimately would need to open up full access just to get the ELB to talk to your host. Now, amazon will allow you to add a special security group that will basically grant the ELB’s full access to your security group and as a result you have guaranteed that access is now secured, in the most part.

However, ELB’s are by their nature publicly accessibly, so what do you do if you’re in EC2 and want to secure your ELB which you may need to load balance some traffic. Well Nothing. The only option available for you in this situation is to use a ELB within a VPC which gives you that ability to apply security groups to the ELB.

There are ways to architect around this using apache but this does depend on your architecture and how you intend to use the balancer.

Everything will fail
Don’t rely on anything to be available, if you make use of the API to return results expect it to fail or to not be available. One thing we do is cache some of the details locally and add some logic around the data so that if it’s not available it continues to work.. The same principle aplies to each and every server / service you are using, where possible just expect it to not be there, if it has to be there at least make sure it fails gracefully.

AWS best practice – Introducing Amazon

Introducing Amazon

Last week I introduced the Cloud, if you missed it and feel the need to have a read you can find it Here Now on with Introducing Amazon…

I’m not really going to introduce all of Amazon, Amazon release a lot of new features each month but I will take you though some of the basics that Amazon offer so when you’re next confronted with them it is not a confusing list of terms, I won’t go into any of the issues you may face as that is a later topic.

EC2 Elastic Compute Cloud, this is more than likely your entry point, it is in short a virtual platform to provide you an OS on, they come in various shapes and sizes and different flavours. For more information on EC2 click here

ELB Elastic Load Balancer, this is used to balance web traffic or tcp traffic depending on which type you get (layer 7 or Layer 4) an ELB is typically used to front your web servers that are in different Availability Zones (AZ) and they can do SSL termination.

Security Groups These are quite simply containers that your EC2 instances live in and you can apply security rules to them. However, two instances in the same security group will not be able to talk to each other unless you have specifically allowed them to do so in the security group. It is this functionality that separates a security group from a being considered a network, that and the fact each instance is in a different subnet.

EIP Elastic IP, These are public IP addresses that are static and can be assigned to an individual EC2 Instance, they are ideal for public DNS to point to.

EBS Elastic Block Storage, In short, a disk array attached to your EC2 Instance. EBS volumes are persistant disk stores, most EC2 instances are EBS Backed and are therefore persistant. However, you can mount ephemeral disk drives that are local storage on the virtual host, these disk stores are non-persistant so if you stop / start an instance the data will be lost (they will survive a reboot)

S3 Simple Storage Service, S3 is a simple key value store, but one that can contain keys that are folders, and the value can be anything, text files, word docs, ISO’s, html pages etc. You can use S3 as a simple web hosting service if you just upload all of your html to it and make it public. You can also push S3 data into a CDN (Cloud Front). There are some nice security options around accessibility permissions and at rest encryption for your S3 buckets. An s3 bucket is just the term to describe where your data ends up and is the name of the S3 area you create.

IAM Identity and Access Management, This is a very useful service that will allow you to take your original account you used to signed up to amazon with and lock it away for eternity. You can use IAM to create individual accounts for users or services and create groups to contain the users in, with users and groups you can sue JSON to create security policies that grant the user or group specific access to specific services in specific ways.

VPC Virtual private Cloud, This is more or less the same service you get via EC2 but private. There are some interesting elements of it that are quirky to say the least, but you can create your own networks making your services private from the greater amazon network but you can still assign EIP’s if you so wish. Most services, but not all are available with a VPC and some features are only available in VPC’s such as security groups on ELB’s.

AZ Availability Zones, are essentially data halls, or areas of racks that have independent cooling and power but are not geographically disperse. i.e. an AZ can be in the same building as another. Amazons description is as follows “Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region” This will be touche don later.

Region A region is a geographically disperse amazon location, it could be in another country, it could be in the same country, I’d imagine that all are at least 30 miles apart but amazon are so secretive about everything it could be that building behind you.

If you want to know more about the products I would read the product page here. In next weeks post I’m going to start going into a bit of detail about architecting for the cloud and some design considerations that you should be aware of.

AWS best practices – Introducing cloud

Overview

With this series of posts over the next few weeks I am attempting to help those new to Amazon Web Services (AWS) get a step up and to help you avoid some of the pitfalls that I have encountered, the sort of guide I would have been biting peoples hands off for when I was in the same boat, but before we go any further, a picture, people like pictures, you are people so here’s a picture.

Money Tree 2

It’s a picture of a tree trunk which over the years has been embedded with coins, people walk by, they see that someone has pushed a coin in, so they do the same, rinse and repeat for 10 years or so and you end up with several trees like this one. This is essentially what the Cloud is to people, 5 years ago, no one knew what the cloud was, no one cared, 3 years ago, people said “Hey, look at this”, 2 years ago people said the cloud was going to change the world, a year ago people said big business was adopting the cloud and today I tell you not to with out reading through this.

Although I am going to focus AWS the topics covered are more than likely relevant to other cloud providers and I would encourage you to read through this to cement the foibles of the largest cloud provider Amazon, so you can better understand the contraints they place on you and those of other cloud providers may place on you. Now on with the post.

What is the Cloud

Apologies for the history lesson, feel free to breeze over…

In the beginning there was only one way, build a data centre, source your own power, cooling and network and start building out a data centre full of disk arrays, high performance servers and networking equipment, I would label these the “Golden age” but truth be told running your own data centre from the ground up can not be easy.

As with everything progression; some smart people noticed an opportunity and started to take over the management so all you had to do was turn up with your server, disk arrays, and networking, this is co-location and is a good way of doing things, this is not as cheap as doing your own but takes a lot of hassle out of it.

Leading on from this companies began to form that went one step further, they would provide your equipment for you so all you have to do now is login, all the disk worries, network were taken care of and they would help you on your way, of course charging a premium for the service. Moving on from this but in the same area of hosted services are the almost fully managed solutions where they do everything, you give them an application and they make it happen, great if you don’t have an IT team.

So getting onto more recent times, virtualisation has really taken off in the last 10 years despite being around for longer, I believe the big drive for this was after the “.com bubble” burst back in the early 2000’s and companies were looking for ways to save costs on their hosted or co-location services. One of the ways this was achieved was through virtualisation such as Xen and VMWare. In most cases the equipment was run and managed by the company and people over allocated or under allocated memory and CPU depending on their use profiles with all sorts of redundancy.

As you can see from all of this the constant push is to reduce costs, granted running your own data centre is the cheapest way, but you will need a few thousand servers to make that so. Faced with a problem a company called Amazon, who you have probably heard of, they run a web shop by the way; noticed that even with all of their virtualisation technology they still needed a large percentage of servers just for 2-3 weeks of their business each year, the rest of the time te boxes sat idle; but what are you to do, you have to have the capacity for your peaks. Well they worked it out and we have the cloud, i’m not 100% sure if they had the big idea but they certainly took the idea and ran with it.

The idea behind Cloud computing is a utility based cost $X per hour, this comes across as a very cheap modle but as we’ll find out later in the following posts it’s not that cheap and it depends heavily on your use modle. With the Cloud you now have the ability to choose how much disk you want and for how long and how much CPU time you need. This is the joys of Cloud computing.

Summary

That was a rather long introduction to the cloud but with this understanding of the history behind it and how it was born you will now hopefully appreciate where it is going. I wouldn’t be surprised if most of the the features Amazon release are just new ways of them making better use of their own applications and architecture and then working out how to do that more times to cover the costs and offer it as a service.

AWS Outage

Not again

Yes, again, another Amazon outage, in fact their reports are a little more miss leading and much more forgiving than the truth. For some background on the rant look Here and the official words: Here

So on the 29th we saw a small minor issue with a couple of serves EBS volumes suffering, luckily we identified and fixed the issue quickly by removing the nodes from the clusters. Well with that problem dealt with on with a restful weekend… not so much.

During Saturday we had a single minor incident but on the whole we seemed to survive. At some point in the early hours of Sunday on the 1st July in the UK but I guess 30th June for the Americans, another issue, here *sigh* To be honest I typically wouldn’t mind, it’s a data centre, they provide multiple regions for a reason so you can mitigate this and rest assured the availability zones are all separate well separate ish, either way they provide availability zones which are meant to be fully isolated form each other.

Wakey wakey, rise and shine

So luckily I was not on call and some other team members dealt with the initial fallout. 5 hours later my phone starts ringing, which is fine it will do that as an escalation. On a side note, about 11pm the night before my PC just stopped working and suffered a kernel panic, so I lost DNS / DHCP so no internet access easily. I rolled downstairs more or less right away and started playing with my mobile phone to set up a wireless hotspot, wonderful 2 bars of GPRS, thankfully it was enough for a terminal (or five).

It turns out that almost all of our servers were completely inaccessible, now we very much divide and conquer with our server distribution, so each node is in a different AZ (Availability Zone) on the assumption that they should be fine. On a side note I will write down some information I’ve learnt about Amazon for those starting a hosting experience with them so they can avoid some of the pitfalls I’ve seen.

Anyway, back to the point I keep procrastinating from. We managed to bring the service back up and working which wasn’t difficult but it did involve quite a bit of effort on our part to get it back working. What I was able to spot was a high amount of IO wait on most instances or at least on the ones I could get onto. In some cases a reboot was enough to kick it on its way but on others a more drastic stop / start was needed.

The annoying thing for me

Is that they have had outages of this kind in the US-East region in the past Time and Time Again. They obviously have issues with that region, and there are some underlying dependancies between AZ’s that aren’t clear, like power, EBS, RDS and S3 obviously these services need to be shared at some point to make them useful but if they are integral to what you are doing then simply putting your servers in another availability zone won’t be good enough. For example if you are using any instance… you are probably EBS backed, as we know they are not infallible.

I do hope that they are looking to make some serious improvements to that region, we are certainly considering other options now and trying to work out the bets way to mitigate these types of issues. If you are not heavily tied into US-East I would suggest abandoning ship apart from your most throw-away-able servers. I’m sure the other regions have there own issues as well, but at the same frequency?

The other thing that is puzzling, is that when we saw the issues Amazon claim all is well. There is certainly a miss understanding somewhere, I guess it could be that we were hit by the Leap second bug, but all of our devices are on new enough kernels that they shouldn’t have been affected but alas they were, either way something happened and it may remain a mystery for ever.

Alfresco Cloud – Out of beta

Finally!

For those of you that don’t know, I work at Alfresco in the Operations department specifically looking after and evolving our cloud based product. It feels like an absolute age I’ve been working on the cloud product and the release, but finally today (Well 31st May 2012) we took off the “Beta” tag from the product.

Being on the support side of the service I know the system very well and overall I’m really pleased with how it is now and how it will be. The most fantastic thing about the product is knowing what is coming up, just because we have taken the beta label off we are still going to be innovating new ways of doing things, utilising the best technology and writing some bespoke management tools to help support the environment. Granted, now the Beta tag is off we have reduced the amount of disruptive impact we will have on the system, but, unlike all those months ago we now have the right framework around managing and testing the changes we are making.

I’m looking forward to the next few months as I know we’ve got more good stuff coming and I can’t wait to see how the general public take to the product, it’s been an interesting journey and it looks to be getting better!

What can you expect from Alfresco in the cloud?

I’m going to start this with a warning, I’m not in marketing or product design, so this is just the way I see the product and what I like about the cloud product. For those of you that have used Alfresco before you’ll be familiar with the Share interface, it is some what cut down for the cloud but none the less just as powerful. You can still upload your documents, like & comment on them just as always, you can use the Quick share feature to share a document via email, facebook or twitter so there’s no need to invite everyone to see a single document or picture. For those privileged enough to sign up you can use WebDav to mount the cloud as a drive to your local PC, very handy…

And the best bit… well you have to sign up to find that….

Only a short update today, it has been a very busy week to get this all sorted and now it’s time to rejoice… and rest.

DNS results in AWS aren’t always right

A bit of background

For various reasons that are not too interesting we have a requirement to run our own local DNS servers that simply hold the forward and reverse DNS zones for a number of instances. I should point out that the nature of AWS means that doing this approach is not really ideal, specifically if you are not using EIP’s and there are better ways, however thanks to various technologies it is possible to get this solution to work, but don’t overlook the elephant in the room.

What elephant?

A few months ago while doing some proof of concept work I hit a specific issue relating to RDS security groups, specifically where I had added in the security group that my instance was in to grant it access to the DB. One day after the proof of concept had been running for a number of weeks access to the DB suddenly disappeared with no good reason and it was noticed that by adding in the public IP of the instance to the RDS security group access was restored, odd. The issue happened once and it was not seen again for several months, it then came back, odd again, luckily the original ticket was still there and another ticket with AWS was raised, to no avail.

So a bit of a diversion here; if you are using Multi-AZ RDS instances you can’t afford to cache the DNS record, at some random moment it may flip over to a new instance (I have no evidence to support this, but also can’t find any to disprove) so the safest way to get the correct IP address for the DB instance is to ask Amazon for it every time. So you can’t simply take what ever the last IP returned was and set up a local host file or a private DNS record for it, that’s kinda asking for trouble.

So we had a DNS configuration that worked 99.995% of the time flawlessly, and at some random unpredictable time it would flake out, just a matter of time. As everyone should we run multiple DNS servers, which made tracking down the issue a little harder… however eventually I did. Depending on which one of our name servers the instance went to, and how busy AWS’s name server was when which ever of our name servers queried it depended on the results we got back. Occasionally one of the name servers would return the public IP address for the RDS instance, causing the instance to hit the DB on the wrong interface so the mechanism that does the security group lookup within the RDS’s security group was failing; it was expecting the private IP address.

The fix

It took a few mins of looking at the DNS server configuration, and all looked fine, and if it was running in a corporate network that would be fine, but it is not, it’s effectively running in a private network which has a DNS server already running split views. The very simple mistake that was made was the way the forwarders had been set up in the config.

See the following excerpt from here

forward
This option is only meaningful if the forwarders list is not empty. A value of first, the default, causes the server to query the forwarders first, and if that doesn’t answer the question the server will then look for the answer itself. If only is specified, the server will only query the forwarders.

The Forward option had been set to first, which for a DNS server in an enterprise is fine, it will use its forwarders first, if they don’t respond quick enough it will lookup the record on the root name servers. This is typically fine as when you’re looking up a public IP address it doesn’t matter, however when you’re looking up a private IP address against a name server that uses split views it makes a big difference in terms of routing.

What we were seeing was that when AWS name servers were under load / not able to respond quick enough, our Name Server got a reply from the root name servers which were only able to get the public IP address, therefore, our instance routes out to the internet, hits Amazons internet router, turns around and hits the public interface for the RDS security group on its NAT’d public IP and thus not seen as within the security group, Doh!

Luckily the fix is easy, set it to “forward only” and ta-daa, it may mean that you have to wait a few milliseconds longer now and then, but you will get the right result 100% of the time. I think this is a relatively easy mistake for people to make, but can be annoying to track down if you don’t have the understanding of the wider environment.

Summary

Be careful, if you’re running a DNS server in AWS right now I suggest you double check your config.

Probably also worth learning to use “nslookup <domain> <name server ip>” as well to help debug any potential issues with your name servers, but be aware that because of the nature of the problem you are not likely to see this for a long long time, seriously we went months without noticing any issue and then it just happens and if you’re not monitoring the solution it could go un-noticed for a very long time.

Google Apps – How easy is it?

A bit of context

Last week, very much inspired by the Internal IT team’s flawless switch over to Google Apps I decided it was about time I resurrected my old email account which was off of my personal domain. Now, I use to have a friend run a mail server for me and that worked okay, I use to run a mail server and that was okay also. Well, apart from the copious amounts of spam, I took drastic measures when I was hosting it with spamassassin and blacklisted domains etc etc but still spam made it through.

Spam was the main reason why I stopped hosting my own email there was just a lot of it and it was becoming much more of a chore than I would have liked. Like most people I’m busy I don’t really want to get home after work and find out that I’ve received 3000 emails, of which some may be legitimate, I also didn’t want to spend huge amounts of time trying various different tools to cut down on the emails, so I stopped hosting it and let it die.

It has been dead now for at least 3 years, and with this blog and seeing how easy google apps was I thought “Why not get my email back up and running, but this time pay someone to host it and take care of the crap” This was a good idea, I decided to do a bit of digging into Google Apps and the costs.

What I found out was…

I was surprised that google offered google apps for $5 per user per month, Not bad I thought, but I kept digging, I wanted to make sure I was getting good value for money and wanted to check I had the right plan for me. So, I went to the comparison page as a place to compare the different options.

Now up to this point the most annoying thing with the google apps was the focus on the business side, I may / may not use my domain for business, I have no income, I have no outgoings I just wanted my mail hosted for me, maybe my wife too so I was very pleased to see the individual option on the comparison page.

What I liked about this was that I was an Individual / Group / Entrepreneur, which automatically entitled me to a free account. This was purely by luck that I found this and I was surprised to find it, it was exactly what I wanted, free email hosting by a company that knows how to handle spam and not get me involved in the process. I have always been fond of GMails ability to filter spam, and now I had it for my personal domains!

Was it easy?

It definitely wasn’t easy to find the free email hosting, but it was really straight forward to set up. There’s a nice walk through that is really simple to follow, the hardest bit was verifying the domain, mainly as I don’t host my website any more, so I spent 15 mins setting up a site on www for my domain to realise it wanted it on the root of the domain, at which point I went for alternative ways to register the domain. A true winner came up, add a TXT record to DNS! Luckily I host my own domain, by that I mean I have the authoritative zone and I send it out to some public slaves to do the leg work thanks to Gridstar. So it was that simple a few step by step instructions on the setup; a bit of time to authorise the domain and bingo, working email hosted by Google.

Of course I had a Gmail account already, and I was able to use the multi account feature to login to both accounts and flip between my email accounts. So far it’s only been a couple of days but it works fine.

As I mentioned I was some what spurred on by my companies move to google apps, if they hadn’t have moved I wouldn’t have looked at google apps at all, either way I’m now pleased I have my domain hosting emails again, it won’t be long and I’ll have it hosting a www website as well!

Cloud deployment 101 – Part3

The final instalment

Over the last couple of weeks I have posted a Foundation to what the cloud really is and How to make the best use of your cloud. This week is about tying off lose ends, better ways of working, distilling a few myths and setting some things straight.

Infrastructure as code

DevOps is not the silver bullet, but it is a framework that encourages teamwork across departments to have rapid agility on deployment of code to a production environment.

  • Agile development
    • Frequent releases of minor changes
    • Often the changes are simpler as they are broken down into smaller pieces
  • Configuration management
    • This allows a server (or hundreds) to be managed by a single sysadmin and produce reliable results
    • No need to debug 1 faulty server of 100, re-build and move on
  • Close, co-ordinated partnership with engineering
    • Mitigates “over the wall” mentality
    • Encourages a team mentality to solving issues
    • Better utilises the skills of everyone to solve complex issues

Infrastructure as code is the fundamentals of rapid deployment. Why hand build 20 systems when you can create an automated way of doing it. Utilising the api tools provided by cloud providers it is possible to build entire infrastructures automatically and rapidly.

Automation through code is not a new concept, Sysadmins have been doing this for a long time through the use Bash, Perl, Ruby and other such languages, as a result the ability to program and understand complicated object orientated code is becoming more and more important within a sysadmin role, typically this was the domain of the developer and a sysadmin just needed to ”hack” together a few commands. Likewise in this new world, Development teams are being utilised by the sysadmins to fine tune the configuration of the application platforms such as tomcat, or to make specific code changes that benefit the operation of the service.

Through using an agile delivery method frequent changes are possible. At first this can be seen to be crazy, why would you make frequent changes to a stable system? Well for one, when the changes are made they are smaller, so between each iteration there is a less likely total outage. This also means that if an update does have a negative impact it can be very quickly identified and fixed, again minimising the total outage of a system.

In an Ideal world you’d be rolling out every individual feature rather than a bunch of features together, this is a difficult concept for development teams and sysadmins to get use to, especially are they are more use to the on-premise way of doing things.

Automation is not everything

I know I said automation is key, the more we automate the more things become stable. However, as automating everything is not practical and can be very time consuming, it can also lead to large scale disaster.

  • Automation, although handy can make life difficult
    • Solutions become more complex
    • When something fails, it fails in style
  • Understand what should be automated
    • Yes you can automate everything, but ask your self, Should you?
    • Automate boring, repetitive tasks
    • Don’t automate largely complex tasks, simplify the tasks and then automate

We need to make sure we automate the things that need to be automated, deployments, updates, DR
We do not want to spend time automating a solution that is complex, it needs to be simplified first and then automated; the whole point of automation is to free up more time, if you are spending all of your time automating you are no longer saving the time.

Failure is not an option!

Anyone that thinks things won’t fail is being rather naïve, The most important thing to understand about failures is what you will do when there is one.

  • Things will fail
    • Data will be lost
    • A server will crash
    • An update will make it through QA and then into production that reduces functionality
    • A sysadmin will remove data by accident
    • The users will crash the system
  • Plan for failures
    • If we know things will fail we can think about how we should deal with them when they happen.
    • Create alerts for the failure situations you know could happen
    • Ensure that the common ones are well understood on how to fix them
  • You can not plan for everything
    • Accept this, have good processes in place for DR, Backup and partial failures

Following a process makes it quick to resolve an issue, so creating run books and DR plans is a good thing. Having a wash up after a failure to ensure you understand what happened, why and how you can prevent it in the future will ensure the mitigations are put in place to stop it again.
Regularly review operational issues to ensure that the important ones are being dealt with, there’s little point in logging all of the issues if they are not being prioritised appropriately.

DR, Backup and Restoration of service are the most important elements of an operational service, although no one cares about them until there is a failure, get these sorted first.
Deploying new code and making updates are a nice to have. People want new features, but they pay for uptime and availability of the service. This is kinda counter intuitive for DevOps as you want to allow the most rapid of changes to happen, but it still needs control, testing and gatekeeping.

Summary

Concentrate on the things that no one cares about unless there’s a failure. Make sure that your DR and backup plan is good, test it works regularly, ensure your monitoring is relavent and timely. If you have any issues with any of these fix them quick, put the controls in place to ensure they stay up to date.

In regards to automation, just be sensible about what you are trying to do, if it needs automating and is complicated, find a better way.