AWOL – Sorry!

December 26, 2013 / Matthew Smith / No Comments

An Apology to you all

I thought it was time I apologised for not being around much for the last few months. The new job I took on in September has had some challenges and by that I mean problems, and by problems I mean evolutionary screw ups. For the battle hardened sysadmin this is nothing but ordinary, this is the first time I had started somewhere that from the scratch wanted to do “DevOps” and it was all about continuous delivery but unfortunately they made a few mistakes which I want to cover to ensure that not only you as fellow readers can look to avoid these but also so you can understand the steps we have taken to start on the path of fixing it; and do not be delussioned to think we are near the fix we have simply turned the boat to face the right direction while we work out how to keep it on course is a conversation for next month.

I’m sure this is going to be a hard battle to win but I’m certain in 3 months time we will have some of the basics under control while stretching for nirvana as all good teams should, now on with the fundermentals that no one should fail, really, well at least try not to.

Now do pay attention 007

Magic can not happen with out hard work – Buying a book, like the Lean startup and preaching it to the masses as the right thing to do is fine, doing that and then failing to follow what you preached is bad. Large organisations looking to do the lean startup should not simply spin up a department sent a huge budget and then expect magic. That is only part of the journey, trying to do the lean startup with out measuring and learning is asking for trouble and for you to loose focus on why it was a good idea.
Don’t implement the end goal first – I mean, Seriously! I’m sure there was a famous saying, “Run before you can walk”? okay I’m obviously being facetious and you should know “We must learn to walk before we can run” an not trying to quote Tony Stark With IT Operations DevOps there is a cost paid for every sticky plaster and ever ‘good enough’ solution, that toll is really easy to fix early on. People understand this with software development but not operations, a silly 2 min decision about going live before the system is ready can take years of a lot of people trying to fix it vs delaying by a week or two, the cost benefit analysis would look hysterical. Continuous integration and delivery are built on good foundations, if you can’t build a system, or manage a release manually successfully you are not ready, try harder. I’m not saying don’t push your self but you all need to understand where the line is and not compromise on that unless there’s a good cause to do so.
Accountability and structure are key – Someone within the business needs to own the whole operational lifecycle before the system is released, In fact while having the idea for running a service, Employ someone then to work out what they need to make it a success 3 – 6 months from now. The operational involvement in releasing a service should be iterative and inclusive from the off set else you’ll end up with some code that is not deployable.
Third parties don’t care about your problems – I’m not saying it’s true for all cases as they should care, but they are driven by different things, different requirements that makes it easy for them to move on. Just because the third party can run the software they made doesn’t mean you can, and to said third parties, Release well tested / versioned code or run a service, don’t do both badly, which you are, sorry. At least by running a service you can hide the fact your code is bad, but start giving it to people when it’s clearly a version 1 and not suitable to be run anywhere is bad.

We can fix it

One of the challenges I faced first was not having the right structure for any real push back on ‘crazyness’ i.e. no management level buy in, no operational seat on the table. This is quite important as it allows the team to push back on work without being distracted on other tasks, someone needs to make the hard call about live site down and releasing code.

Stabilisation of the core fundamentals is critical, get to a stage where reproducibly building the system from scratch is possible. Ensure that the update and releases can be performed reliably manually. Make sure that the system is supportable end to end.

Have an end goal, work out what utopia is and narrow down in on it as time goes on, start executing to a plan, think of this more as playing civilization V or something, if you have a strategy you will be fine, shooting in the dark will end badly for everyone.

Summary

There is always a way out of the most horrid situations, it does require some compromising on the solutions. the Goal is to do what ever it takes to build up slack, enough slack that some of the bits that were done badly can at least be done properly to build up yet more slack, Hopefully by the time there is a few months of slack built up some sort of system can be implemented to ensure ongoing operations remain focused and within plan. Just remember the end goal and aim for it (roughly).