Why is the UK Government struggling with IT?

Recently I’ve had an insight into how the government is conducting it’s IT. It’s eye opening. If I had to sum it up, it’s like applying 1990’s approaches to current practices, but what does that mean, let me explain.

Many moon’s ago, the UK government decided it was not an IT specialist and that it should not be in the business of running IT. This lead to it outsourcing a lot of its core IT to often large organisations who on the whole I imagine did make it much more cost effective, most likely at the cost of efficiency. Fast forward and over the years there have been many changes to encourage open source technologies and agile methodologies no doubt to address these inefficiencies that have crept in. It was probably not predictable that if you out source it there’s no incentive for them to do it efficiently. Even now with the new technologies there’s mindset issues holding back on unlocking the benefits to a DevOps style approach.

devops

What do I mean by a DevOps style approach? I mean that people should be encourage to use evidence to drive decisions, Automate the process, be encouraged to fail and measure the results against expectations. Based on what I’ve been able to observe over 2 months there’s a number of issues stopping the deliver of true efficiency and agility.

  1. Lack of technical management
  2. Too many consultant / IT outsourcing companies
  3. Mentality and approach barriers

Let’s tackle the causes of these and how to fix them one at a time.

1, Due to the outsourcing approach and the ideal that government does not do IT the technical decisions are being made by people in the government who are not adequately having the  problem explained to them. This is happening because the consultancies do not wish to harm their business by being seen as being ‘difficult’ or by saying “No” to release schedules. This also does not allow for a clear strategy to be defined for the entire department to work towards as there is not central guidance for the consultancy groups to adhere too. In 2 months I have not met anyone who is in technology within the department I’m working that is a position above me.

2, Due to the preference these days to use small consultancies and to bring in subject matter experts in various fields there is a lack of cohesion between styles and approaches to delivery. Everyone is pulling in slightly different directions, this is made worse by the lack of central technical leadership which is why it is the biggest problem that needs fixing.

Stop trying to change things

3, I am not lying when I say I have been told every week for 8 weeks “Stop trying to change things”. There is a culture that has developed where if you speak out you are seen as difficult. If you suggest improvements that require sign off or budget it does not happen because people are scared to approach the subject as they will have to justify how it speeds up delivery, no one cares about cost savings or efficiencies. People can and will change their outlook when it can be evidenced that change is benefiting the process but that requires point 1 again, good leadership and the ability to explain back reasons why fixing things up is worth while.

I’m not saying this is easy, there are payment deadlines that need to happen. It was pointed out to me that when it comes to government if it says it is going to do it it really must, no if’s or buts. Any failure is political canon fodder and the general public are entirely scathing when things don’t happen on time. This makes it harder than it would typically be, however, not impossible.

As a UK taxpayer I’m utterly shocked at how bad it really is, I had suspicions based on what I’d heard or read but I wasn’t expecting people to be so worn down and unwilling to change or for the consultancies to be so scared of speaking up but I guess when you have a revenue stream dependant on the government you wouldn’t want to cut that off. I guess ultimately my concern is that new technology will be implemented without fixing the root cause of the issue, the culture and organisation. No doubt if anyone was to read this of any power they would insist they have checks and measures are in place to ensure this can not happen, the reality is the people deciding are not qualified to know if it is a good deal or not.

How to Fail and make it a Success

Failure-success

There comes a point where we all fail, it doesn’t matter when, if you don’t think it’s happened yet, give it time; either way it’s coming. The question you need to ask your self is “What am I going to do about it?”. I’ve worked in places where failure was a point the finger affair and places where it wasn’t. It is clear to me that failure is the only way to move forward and succeed, you just need the right strategy for dealing with failure that allows you to move on with life and to make the changes you need to make things better.Remember you are not going to fix all problems immediately on your first attempt but stick to the process and religiously follow it and eventually you will be in a better place.
Thomas Edison famously got it right:

I have not failed. I’ve just found 10,000 ways that won’t work…

The whole point of failure is to learn from it, and as long as you remember that you will succeed. With that in mind the most common mistake I see is the failure to learn. It’s fine to fail, fail all day long if you want. The important thing is to have the right mechanism to cope with the failure so you ensure you learn from it, this doesn’t mean it needs to be process heavy but it does need to be done religiously every failure.

There’s a few things I ask for every failure regardless if it was customer facing or internal, failure is failure is failure.

  • What can we do to stop this happening again?
  • How can we get more notification next time?
  • Did we have the right people looking at this at the right time?

I feel the need to be abundantly clear here, “What can we do to stop this happening again?” literally means what crazy ideas do people have to stop this? do we add a new layer in? do we double up somehow? throw things behind a load balancer? It’s no good to have a room full of bright people if you can’t answer this question, there’s always something that can be done, a change in process, some crazy technical solution or just adding more capacity.

Getting more notification is important, not just after the event but can you predict the event? the obvious example is disk space, when it comes to other issues your millage may vary. Either way you should be able to do something to give you a little more time to start dealing with it, even if it’s something simple like upping the rate of the checks and the failure notification so you get the alert 1 min sooner than before.

Having the right people is also important and i’m not talking about having Bob on call rather than Chris I’m talking about getting developers awake at the right time. Let’s say there’s a memory leak, the alert should wake up both a sysadmin/DevOps guy and a developer. The only thing the sysadmin can do is make sure that the memory buffers are cleaned so it can start again (ready to fail at an undetermined later point) or automate restarts. These are all working around fixing the problem and are things to be considered when it comes to “What can we do to stop this happening again?” but You wrote the app, you have the developers so why would you not do both, have the DevOps/Sysadmin stabilise the system and minimise the impact while the developers are investigating the cause and writing a fix for the problem.

With these simple tasks in place the only sensible thing to do for your service is to fail, lots, regularly and to then put in place the solutions to stop it happening again. Failure is an option and it’s one I’d recommend; with the appropriate framework in place!

EBay – old fashioned security in a modern day

Hello EBay

Firstly, I like EBay and have been using it for over 10 years. When I found out via news forums about the big security issue I realised I had to do two things.

  1. Update my email address to one I actually use
  2. Set a secure password

For some reasons both of these rather simple things caused my problems due to “Security”, so lets look at each one and work out why it is a problem.

  1. Email address can not contain your EBay userID
  2. The “Secure” password can only be 20 characters long and could only contain ‘-_@’

With point 1, there’s nothing I could do other than contact support which I did (only tonight after being bored) and spawned this, with 2 I just coped and validated today to find out they have now got a new password policy and it seems to have been set by someone sensible and is now as follows:

EBay password policy

So this does mean my rant about two things now becomes a rant about one thing, but it is the one that is annoying me the most so here goes.

Security Theatre

There’s two types of security, theatre and actual. Actual security results in the system being secure, i.e. implementing two factor authentication. Theatre on the other hand is things people do to make you think it’s secure, i.e. insisting that your username and email address are different. Why is that Theatre? well simply put, my userId can be easily found out so is considered public domain knowledge, secondly knowing my userId should not make logging into my account any easier, thirdly, it not being the same as my email address can only stop people guessing my email addresses or other details.

So Ebay have implemented (back in 2004 apparently) that your UserId can not be present in your email address, so Let’s say my Ebay userId is soimafreak (it is) I can not use any of my normal personal email addresses as with most people I have an internet handle and I stick to it. Sure I could use a different username on every site, that does stop people guessing my username. But, again, knowing my username should not make it easier to hack my account… unless you have poor security to start with…Ebay…

Let us go on a story telling journey now and hypothesise how bad Ebay’s security really is at its core. To do this you have to understand that Ebay was an original .com bubble company back in the good ol’ days where good security consisted of two things, one md5sum a users password and make sure your DB is not accessible on the internet and make the access restricted by username and password.

So as discussed before md5 has some flaws, but I imagine up until recently Ebay used an approach like this or maybe worse for storing passwords. Why is this bad? Well you can be subject simply to Rainbow attacks which are very common place. Now lets say it gets to 2004 and you hear about people doing that, what simple security precaution could you take with out re-hashing everyones password, which would require everyone changing their passwords… well if you insist that the UserId is not the same as or contained in the email address that means for those specific users it would be slightly harder to work out what their username was. Was it a gmail.com? hotmail.com? aol.com address with their userId on it.

Why is this so pointless?

I’m not saying it was a bad thing to do back then, I’m saying it’s a bad thing to still be doing now because things have moved on. I take my passwords quite seriously and as time goes on I move more and more websites into keepassx where I have no idea what the password is. It would not be hard to guess or work out most peoples usernames for websites, I’ll give you a clue, it’s normally their email address or some other UID like your Ebay userID so right aay I can get everyones userId but I shouldn’t be able to break their password. The problem comes if I crack your password on an insecure site, as you may recall from this earlier I don’t have to know your password I just have to know a string that generates the same password which is why salts are important. So going back to Ebay, let’s say I pick a random ebay user my-pet-frog I found this by searching for “wibble” on ebay and I found this and what’s on this page…

mypetfrog

So I now have their email address or at least a couple more leads to try, so again, what’s the point of the original security put in place in 2004 when the real solution is to educate users and to implement actual security and not security theatre.

Summary

So I ask you EBay to implement actual security and not theatre and more importantly to let me change my sodding email address.

Now as for my-pet-frog I feel bad, they hopefully will read this and see that they should not share those details on Ebay because of ‘security concerns’, but why shouldn’t they, should all EBay users insist that Ebay just implements actual security so the users can use the system in a better way with out having to make their email addresses public because of security theatre and a lack of education from Ebay to its users. Anyway as I was bad and used my-pet-frog as an example I hope to go some way to compensate them.

Please check out their Ebay shop or their Amazon store front or better yet their actual website Hotscamp.com there really are some awesome T-shirts on here and one of my favourites is this Back to the future one or this Portal one

I do have a massive transcript of the conversation I had with Ebay customer support about this issue, but largely irrelevant other than they are tied by the same system and they were helpful. Ebay did graciously allow me to write a letter of complaint to their complaints department but that was too old Fashioned for me so they get a Blog rant. However if you would like to pritn this blog and send it to their complaints department here’s the details:

Complaint Department 
P.O. Box 9473 
Dublin 15 
Ireland 

Enjoy!

AWOL – Sorry!

An Apology to you all

I thought it was time I apologised for not being around much for the last few months. The new job I took on in September has had some challenges and by that I mean problems, and by problems I mean evolutionary screw ups. For the battle hardened sysadmin this is nothing but ordinary, this is the first time I had started somewhere that from the scratch wanted to do “DevOps” and it was all about continuous delivery but unfortunately they made a few mistakes which I want to cover to ensure that not only you as fellow readers can look to avoid these but also so you can understand the steps we have taken to start on the path of fixing it; and do not be delussioned to think we are near the fix we have simply turned the boat to face the right direction while we work out how to keep it on course is a conversation for next month.

I’m sure this is going to be a hard battle to win but I’m certain in 3 months time we will have some of the basics under control while stretching for nirvana as all good teams should, now on with the fundermentals that no one should fail, really, well at least try not to.

Now do pay attention 007

  1. Magic can not happen with out hard work – Buying a book, like the Lean startup and preaching it to the masses as the right thing to do is fine, doing that and then failing to follow what you preached is bad. Large organisations looking to do the lean startup should not simply spin up a department sent a huge budget and then expect magic. That is only part of the journey, trying to do the lean startup with out measuring and learning is asking for trouble and for you to loose focus on why it was a good idea.
  2. Don’t implement the end goal first – I mean, Seriously! I’m sure there was a famous saying, “Run before you can walk”? okay I’m obviously being facetious and you should know “We must learn to walk before we can run” an not trying to quote Tony Stark With IT Operations DevOps there is a cost paid for every sticky plaster and ever ‘good enough’ solution, that toll is really easy to fix early on. People understand this with software development but not operations, a silly 2 min decision about going live before the system is ready can take years of a lot of people trying to fix it vs delaying by a week or two, the cost benefit analysis would look hysterical. Continuous integration and delivery are built on good foundations, if you can’t build a system, or manage a release manually successfully you are not ready, try harder. I’m not saying don’t push your self but you all need to understand where the line is and not compromise on that unless there’s a good cause to do so.
  3. Accountability and structure are key – Someone within the business needs to own the whole operational lifecycle before the system is released, In fact while having the idea for running a service, Employ someone then to work out what they need to make it a success 3 – 6 months from now. The operational involvement in releasing a service should be iterative and inclusive from the off set else you’ll end up with some code that is not deployable.
  4. Third parties don’t care about your problems – I’m not saying it’s true for all cases as they should care, but they are driven by different things, different requirements that makes it easy for them to move on. Just because the third party can run the software they made doesn’t mean you can, and to said third parties, Release well tested / versioned code or run a service, don’t do both badly, which you are, sorry. At least by running a service you can hide the fact your code is bad, but start giving it to people when it’s clearly a version 1 and not suitable to be run anywhere is bad.

We can fix it

One of the challenges I faced first was not having the right structure for any real push back on ‘crazyness’ i.e. no management level buy in, no operational seat on the table. This is quite important as it allows the team to push back on work without being distracted on other tasks, someone needs to make the hard call about live site down and releasing code.

Stabilisation of the core fundamentals is critical, get to a stage where reproducibly building the system from scratch is possible. Ensure that the update and releases can be performed reliably manually. Make sure that the system is supportable end to end.

Have an end goal, work out what utopia is and narrow down in on it as time goes on, start executing to a plan, think of this more as playing civilization V or something, if you have a strategy you will be fine, shooting in the dark will end badly for everyone.

Summary

There is always a way out of the most horrid situations, it does require some compromising on the solutions. the Goal is to do what ever it takes to build up slack, enough slack that some of the bits that were done badly can at least be done properly to build up yet more slack, Hopefully by the time there is a few months of slack built up some sort of system can be implemented to ensure ongoing operations remain focused and within plan. Just remember the end goal and aim for it (roughly).

Releasing your first Devops Application

First the worry

When it comes to releasing the first version of an application it’s always worth weighing up the constraints of your environment and the time frame in which the task was delivered versus the skill set available. Inevitably as a skilled DevOps professional you want to do a good job, well done you; however you have to be strong and realise it is not about delivering perfection from day one but about the journey you must take to get there.

I recall the first deployment I did for a version 1 and every time I do one since then I get better, be it a bit more focused or a better starting point. The very first one I did was all over the place, no real configuration management, quite a few manual steps but a well written process, unfortunately that project remained in the depths of secrecy and I ended up moving on.

Constantly I see over engineering and complication added to projects and the root cause of this is worry, I know, I use to be there doing it, it is difficult to step back and be objective to what the business needs, but as a DevOps professional that is your job. When delivering a solution try and remember these things to help you worry less and focus more:

  1. Before being perfect you must first just “be”
  2. When in doubt, do less
  3. If you do not know when the site is down you will not have a job
  4. Always have a backup

Then the delivery

The above list is rather quite useful, use it as a bit of guidance. Starting with point 1, some elaboration; when delivering a solution the most important thing is to deliver the solution, so many people forget this part and focus on the technicalities or whether or not it is the “best” way to deliver the solution. In reality, who cares, no one will care when you are in that meeting explaining why you’re late and have not got a working solution.

Getting stuck in the detail is a horrible place to be and sometimes it gets too involved or too complicated leading to much discussion and inevitably the solution comes out complicated and will take a while to deliver, in these situations point 2 comes in, just do less. It sounds silly but if you’re rushing around struggling to meet a deadline then you need to take things out of scope, and focus on what the actual solution needs to be, maybe you have to have a manual step, then at a later point you can automate it.

The last two points are along the same lines, and those lines are things that get you fired. If your site is down and you don’t know that it is totally down, that’s a bad thing; likewise loosing data is considered pretty poor. However do not get stuck in the trap of assuming you must have full monitoring of every server or that the backup needs to be anything more than a cron job for now.

The “trick” is always around identifying what needs to be done and could be done, by focusing on what needs to be done first you can then come back to improve the rest.

build, improve, rinse, repeat

As touched on earlier You are allowed to cut corners and focus on what is necessary, failure to do this will just lead to delays and a business that is getting rapidly turned off of DevOps. The first release you do can be complete and utter crap, it can be all manual, with nothing more than a simple web check on port 80, that is okay. The important thing is you deliver to the deadline, You have mitigated the main risks of not knowing when the site is down or the potential loss of data, heck even having single points failure are allowed as long as you can clearly identify what the risk is and a solution if that were to happen. In fact, I’d almost go as far to say this is expected.

The key is as always to improve, little and often. Step 1, Manual, Step 2, automate what is easy, Step 3, automate the rest. It has never been and will never be about perfection from version 0.1 onwards you just need to improve a little each time in line with that golden view of what perfection is. As long as you know what the end goal is you can work towards it, just don’t get carried away by trying to deliver it all for the first version.

Helping others

A while back

I was volunteered to help our finance team customise their accounting system which finally I’ve got to the end, or at lest the end is now in sight. It was an interesting situation, a few years back the accounting use to be done in Access and a couple of years back they migrated to a solution provided by Netsuite. I have to say, when I started I spent ages complaining about how awkward and painful it was to do anything in Netsuite, but as I’ve done the journey I’ve come to understand a bit more and it’s not bad, it’s still massively complicated but it is at least not that bad!

One of the most frustrating things I found was using web services, I’d never done it before and on paper it sounded like the best solution, but it was a nightmare, the most annoying thing is the provided wsdl had errors in it, so I ended up having to spend some time finding the bad lines and fixing it from within vi thanks to this link. Needless to say after a few weeks of struggle I started down the right path which was to use more of Netsuite to do more and to write a simpler tool to do the rest.

Time for a rest

I found the rest-let’s which was a big help, use their documentation to work out roughly what to do and then hook it in and hope for the best. Its worth noting the Documentation is excessive and hard to follow, makes you wonder why they bother. Either way After writing that then spending some time trying to get that up and working it was there, just needed the other 7 pieces of the puzzle to make it work, Luckily you can call searches remotely to get a list of results and then manipulate them and send them back. I came up (with a lot of help from the finance team) a search that only returned entries I hadn’t yet changed and then took the details from that and manipulate it into the format I needed to post back to my rest-let to do the updating.

After sorting out the entry of data I then had to just work out how to invalidate the entry if anything was changed, this was harder, it took a bit of reading and then it was clear what to do, just impossible to find out how to do it! I must have spent more time on this project working out how the system worked and just using the tool rather than coding.

Bigger and better things

This initial step was an important one it sets the foundation for a raft of other changes that will hopefully be simpler and easier to put in place, but like anything unknown, we won’t know until we know. I’m looking forward to taking a system that already provides value and making it provide more, with reporting and customisations, hopefully over the coming weeks it progress to produce more useful information.

Summary

In short, Help people, sometimes it starts off being a pain but at least you help someone and you also learn something, yes it may lead to a flurry of other bits but, ask your self this. If you needed help, would you want someone to help you?

What challenges you?

Over the last few weeks

I have been wondering what most people find challenging in the “modern” IT world. There’s been a recent upsurge in tools and technology that address most problems which only leaves me to wonder what is filling that gap? What is the current big annoying problem, maybe it’s not being able to push your architecture into multiple clouds, or having to live with the constraints of small root disk volumes; Who knows? Hence the poll :)

A week in the Valley

While out and about…

Over the last week I’ve been out in the bay area meeting with an important client talking about their needs and how we’re going to make things better for them and for us, all in all a good trip (apart from the plane crash). This was my first time to the bay area and it seems like a nice enough place, it lives up to expectations in some areas and not others, I’m sure with more local knowledge it’s possible to overcome some of the issues I had with the area. The main issue I could see (granted it was only a week) is that it’s not as nice to live there as it is in the UK or even as nice to be around and in like London.

In London, everything is a walk away or a short tube journey, and better yet if you’re willing to travel more than an hour each way each day you can live in the countryside and just commute in; but in the valley everything is a short car journey away, the public transport seems a bit hit and miss and Taxi’s aren’t cheap!
I think its things like this which will be the end of the bay area over the next 10 years unless it changes, and I’m not the only one to think this and as time moves on I think we’ll see a shift in tech start-ups away from silicon valley into areas that are nicer to live.

Which is what brings me back to London, there’s a good start-up culture, there’s more investment going on and there’s some good companies starting to appear, unfortunately for tech startups the UK still isn’t brilliant, but it will get there in time and it probably just needs a few more years and some brave people to trail blaze.

I think London has the makings of a nice tech hub for Europe and will over the next few years start exceeding the bay area, the only thing it’s really missing at the moment is the massive success stories that appear in the bay area every few years, sure there’s some good companies but none are a apple, google or facebook.

I think I could survive out there to live for a while but not forever, it’s nice being able to go to the beach, forrest, mountains what ever you want all within a reasonable drive but there’s too much convenience stuff, like fast food, corner shops and drive throughs. Like Walmart, it’s got a purpose but not for me, Trader Joes seemed better, but no soft drinks just fresh goods and booze… Maybe in time I would have found stuff that felt a little more “me” and a little less American but I’d have to go and give it a go to find out!

For me personally I don’t really want to live in London, it’s just too busy but living out in Hampshire makes London an awkward commute, do able but not every day. As time goes on I’m still hopefully that more start ups will start offering flexible working like we have at Alfresco where going in for 2-3 days a week is the norm and everyone is trusted to do the work, and who knows over the next 10 years maybe more will start filling the M3/M4 corridor which will make living in a nice place and commuting to a nice tech company is all possible.

It will certainly be interesting over the next few years how the tech industry in the UK changes but I’m certain it’s picking up speed.

Time for an idea

Why not

It’s been a while since I’ve thrown myself into an idea and tried to come out the other side, so I’ve spent the last couple of days just thinking about what’s missing. It doesn’t take much to have an idea; but making sure it’s a good idea, making sure it is unique in it’s offering and making sur eit’s better than anything else is not easy.

At work we are working on an idea, a concept of some sort of DevOps tool that takes a lot of what we do already and simplifies it and merges multiple tools into one place, the driving goal is easy of use, take an entire system, data centre what ever you want and within minuets you’ll have the whole thing monitored, feeding metrics back for reporting, performing real time analysis and trending. It’s still very much prototype phase but it’s a very exciting project that wraps up several elements that we as a team are passionate about, ease of use, efficiency, performance, monitoring, measurements and of course, cool technology; but with that said I still have this urge to do something else, I’m not really sure why, I’m busy enough as it is but I feel like the world is missing something that is more than just an amalgamation of parts, or a re-skin of an existing thing, I feel like it’s missing something, the question is as always what?

There’s a saying “All the good ideas are gone” probably true, but that doesn’t stop people striving for new things, look at Glass I’m not convinced it has a long term future in that styling, but wearable tech certainly does, look at this wrist computer from the tv show chuck, just what I’ve always wanted.

Wristcomputer

a lot of the best ideas today are based on things that have come before and re-envisioned, walkman -> ipod; iPaq -> iPad; alta vista -> google

Just because it’s been done in a similar way before doesn’t mean you can’t do it better, or take from their ideas and make it work and half of the battle is the conviction to want to do it better or different. Which is why everyone should try a new idea and everyone should try something new, to make something better.

What to do

This is the bit I’m struggling with, and it’s the hardest bit of the whole thing, for me it’s not good enough to take an idea and make it better, if someone gives you a product and says make it better it wouldn’t take long for a few ideas to bubble up. I’m thinking more along the lines of taking some wacky out there thinking and making it a reality in a way that works and works well.

I think over the next few weeks I’m just going to just write some things down and see which ones stand out, which ones seem stupid/crazy to do and then probably come up with one that works.

I’m not really sure what that is at all, I could take something like sentinel and munge that into something else, but it just doesn’t feel like the right idea. I had an idea a long time ago, probably 3 years ago, which I talked my self out of because it would take me forever to make and I didn’t have the skills, but things have changed, it isn’t even ground breaking it’s just another internet site.

Either way I’ll keep on plodding along for now and see if I can come up with something but until then much more scribbling on paper and throwing things at a wall.

DevOps team DNA

Hi, this is my first post on Matt’s blog. I’ve been an avid supporter of his blogging for a while and today got an invite to contribute. So here’s my post (created very quickly before he changes his mind).

My job has always been within an operations department of software product companies. I started at a small company as ‘everything’ support and slowly drifting towards a specialisation in the more recently branded DevOp’sy areas as I made my way through various acquisitions and mergers. Over the past couple of years I’ve found myself building DevOps teams. During that time I’ve discovered some of the things that work and almost everything that doesn’t work (or it feels like that :) ).

Some of the things that have worked..   (for me anyway)

Obviously these are going to be quite subjective and I doubt they will work for everyone. I’ll focus mostly on what I think are the key ingredients of a successful team. Maybe some people will find it interesting. Bare in mind that this only really applies to an operations team that supports a Cloud service.

I’m not a big football fan but I can draw some parallels between football managers and DevOps teams. You don’t see Arsenal winning and losing games based on their process redesigns. I may be simplifying, and I’m sure tactics plays a large part, but I believe you get quite a bit more out of a team when you have excellent players. Players who excel in different areas. My teams tend to be 5 – 7 players nowadays and between all of us we need to cover a few areas.

The first is product knowledge.. If you have a product guru in your team then you’ve got a productivity catalyst. So many aspects of our work involves investigating whether issues are product vs config and whether we can improve things from an operational perspective that will result in the product running better. The most recent team has a Product Architect and he’s awesome. He’s on the cutting edge of ideas for the product, for Amazon AWS and for all of the supporting technologies. Having a dedicated resource to do all of this in the background is great – it means that when we automate his prototypes and release them we get the maximum benefit. Recent examples include our Public API work and the work being done on our Amazon architecture to improve speed (CDN’s etc).

The second role I’ve always tried to fill is an engineer (at least one person, preferably two). Get the most senior developer(s) that you can, who knows the language of the product and build system of the product that you are supporting. You can now write the high level instrumentation that every DevOps teams need – as is true with any automation project. There is only ever so far you can go with Bash (I tend to take things beyond where they are supposed to be with Bash as it is). Ultimately having a senior developer or two buys you a massive amount of flexibility. Need a web service for something like externalised Puppet variables?.. you can write your own. Backup scripts not fast enough?.. a senior developer will make those scripts look very feeble in comparison when rewritten in their preferred language and multithreaded. I’m careful about not reinventing the wheel and will usually go off and clone something from Github before starting from scratch myself. But having some people who can write stuff from scratch is a major advantage. One caveat I would say for this role – hire from outside. Developers usually end up getting pulled back to work on stuff they did at the company at some stage. If you can, hire a new person and liven things up. Obviously tell the engineering teams that the hire(s) are for instrumentation in case they get worried that you want to start adding buttons to the product :)

Lastly, the sysadmins. I’d actually consider myself one of these at heart. Getting a good sysadmin can be tricky. It’s not uncommon to read 100 CV’s before finding someone even remotely eligible.  For a DevOps team you need a reasonably rare mix of skills.. people who know linux inside out, who can script and get excited by the latest batch of tools, and nowadays you need to throw Puppet / Chef into the mix. I have a couple of these currently and consider myself extremely blessed. Everything that we do is checked into source control (we use AWS as our data center) and this buys us a lot of things.. like the ability to automate everything, reduce costs by deleting and recreating at whim and disaster recovery. However, you pay for those things buy hiring really good people.. which is a cost saving in the long run once the cost saving benefits of the team start to show.

Now if you add in all of those types of role.. what I’ve found works quite nicely is running the team without being too focussed on the separation of responsibility. Everyone is on call 24/7. Everyone is expect to know the product inside out (although nobody will get near the level of expertise of the Product Architect), everyone scripts (even me) and ultimately everyone will end up doing some programming tasks. You can probably see from Matt’s previous blog posts about the Metrics project he got the chance to learn some Ruby. I think it’s important that everyone knows a bit of everyone else’s job.. although when under pressure everyone naturally drops back into doing what they are good at to speed things along.

This probably looks a little odd from the outside. But it makes things fun, everyone stays engaged and ultimately we all share the same goal: scale to 1 million users :D