DevOps: What I Learnt From A Nightmare Journey

DevOps: What I Learnt From A Nightmare Journey

I've just finished what was a very rough journey into a fully-automated deployment infrastructure at a large scale. At times it felt more like I was in a bizarre far-future version of a Victorian penny dreadful, complete with roguish charlatans and cliff-hanger disasters. But we battled through it all and emerged, like all good protagonists, somewhat wiser and more understanding than we were at the beginning of the journey.

Getting it right by accident

One of the surprises for me was that I'd worked on a few such projects, and we'd got things right by a combination of good intentions, happy accidents and a lack of interference. This is something I think a lot of people in the DevOps arena are going to start to encounter - when you're a close-knit team of three people building a few servers and sticking some containers on them, you do things because they feel like the right thing to do; nobody is looking over your shoulder asking you to separate out how much budget you're spending on "containerisation" or saying, "wait, hang on, I've not heard of this Terraform thing before, do we have to use it?"

The problems started when people noticed that they had a team who deployed their code in under 10 minutes, deployed it regularly, and didn't really have any outages or problems along the way. (At one point, my service delivery manager even started pre-emptively asking, "I guess you're going to send me an 'it auto-recovered' e-mail when you get in?" in his e-mails.) They wanted us to take on one of their huge, multi-department projects and work the same auto-scaling, auto-deploying, auto-recovering magic on it.

So suddenly I got catapulted into this crazy Enterprise Land project with about twenty stakeholders, none of whom really "got" what we were doing with DevOps. Hell, I didn't "get" what I was doing half the time, and yet here I was spending an hour in a massive boardroom where "DevOps" was a bullet point somewhere after a hundred different commercial considerations, and most of the people in the room considered it some kind of magic dust you sprinkled on a project right at the very end to make it immediately fifty percent more reliable and a quarter of the cost.

As a result, we made a lot of mistakes.

Mistake 1: The Magic DevOps Fairy

One of the founding tenets of the little team we started with was that anybody could figure this out, given time and a bit of space. As long as you knew conceptually what you were trying to do, and had a solid, testable definition of "done", working out the details of the tech would come naturally. We'd started out as a couple of .net developers and a Javascript guy, and while in hindsight the fact we were all Linux nerds and had mis-spent our youth building networks turned out to be very important, we mostly theorised that if you had a goal of "automate the things" you could probably get there through a combination of figuring it out and going to conferences and talking to people.

Once I hit Enterprise Land this approach simply didn't fly. Everyone in Enterprise Land was obsessed with the idea of functional specialisation; if you weren't already an experienced practitioner, there was no way you could be entrusted with such an important project. We needed to hire a DevOps Expert, who would sit in his Silo Of Functional Expertise and dispense the necessary quantities of DevOps magic to make things work. Huge mistake! In Little Team, none of us were experts, and we knew it. That meant we all sat around discussing how we were going to do things, then paired up and did them. As a result, we never had any blockers or bottlenecks, because if any of us was blocked by a deployment issue we'd just crack on and solve it ourselves then let the others know what we'd done.

As soon as we created the Magic DevOps Fairy, we also created a significant bottleneck in that stuff had to be handed off to this guy, and he was supposed to Make It DevOps and then give it back. Because this wasn't quite enough of a foolish idea, we had also decided that the Magic DevOps Fairy would also be shared between multiple different projects. It's one of those things that sounds almost beguilingly sensible when you're half-asleep in a twenty person meeting and the phrases "standardised approach", "reuse" and "commonality" are flying around. This is a bit like convincing several people to jump off a cliff rather than use the stairs because you'll all accelerate at a standardised rate on the way down.

The Magic DevOps Fairy turned out to be a foundational mistake. We might have recovered from it, had we been sensible everywhere else. But we weren't; we made a number of other critical errors that all compounded on each other.

Mistake 2: Not knowing where you're going

Photo.
One of the reasons I like doing deployment automation is the requirements are very clear-cut. You're almost always working with some variation on, "I want Thing A to end up on Service B, with no noticeable outage during deployment." The technical wizardry to make that happen may be excruciatingly complicated, but at least you know what you're supposed to be doing.

It turned out we didn't. First we were going for classical hosting in a data centre, and it definitely wouldn't be a public cloud of any type. Then we were going to have a temporary installation in AWS until the hosting was ready. Then it was AWS, but no PaaS so we could back out quickly. Finally, a few weeks before the end of the project, we admitted to ourselves that actually everything was going to be in AWS and we may as well start using the PaaS stuff because we weren't backing out any time soon.

Of course, this meant that we had to make compromises on the architecture, and some of them were expensive compromises. Little Team had done things the other way round - we had prototyped a solution, and then let the prototype inform us as to what the best hosting option would be. In Enterprise Land this hosting decision had ended up being someone else's responsibility, far divorced from the team developing the software, and compounded by a strategic directive operating on a completely different timescale to our project.

The lesson here is… well. If you're an eternal idealist, the lesson is that you should make technology decisions as closely as possible to development, i.e. the team doing the development should make the decision. If you're a realist, then the lesson is to be prepared to spend time refactoring your design when you change your hosting; especially in AWS, an inappropriate design will blow more than the development costs to change it in its first few months of hosting fees.

Mistake 3: Doing it all at the end

Once you've made a couple of mistakes, they combine in interesting ways. That's exactly what happened for us. Because we didn't know where we'd be hosting, and because our Magic DevOps Fairy was away sprinkling magic on other projects, we didn't build up our infrastructure as we built our application. In this case we knew what we were doing was stupid (we had plenty of project retrospectives that screamed, "build infrastructure as you go along!") but by distributing our work across so many teams we ended up doing the stupid thing anyway - not by conscious decision, but by a, "the hosting environment isn't ready" here and a, "we need your DevOps guy on a different project" there.

The result? Catastrophic infrastructure build - rather than keeping pace with simple applications as they grew, we were trying to force complex applications on to completely new infrastructure and finding that those applications made assumptions which were bad for the new environment. Also, we were finding these problems at the worst possible time - near the end of the project, with little time to correct them, and too much existing project complexity to effectively bring new team members in to help (a classic Fred Brooks problem). As a result, we had to take compromises - applications that were supposed to be load balanced ended up on a single server, connections got hard-coded to individual boxes, and installation tasks were done manually rather than automated.

Mistake 4: What Measure an Expert?

Photo.
On Little Team, none of us considered ourselves experts. That meant we had no problem with asking each other for help, peer-reviewing our work even when we didn't think we needed help, and pushing back on the business when things got thorny and we needed some more time to figure out a solution.

In Enterprise Land, we had an expert. And then we broke our expert, by expecting the guy to deliver too much, too fast, in tools he didn't know how to use. And because we'd pushed DevOps into its own little silo, we weren't collaborating as a team and solving problems together. This started the spiral - desperate to get things done, our DevOps expert would hack them in manually late at night, in the hope of correcting them after the fact. Only there was never time to correct those things. So he started not checking in his code, in the hope he could rescue it all before someone spotted that servers which were supposed to be provisioned by masterless Puppet didn't even have the agent installed.

Worse, as things started going wrong from manual hack compounded on top of manual hack, we started getting excuses. "Sometimes ELBs just come up wrong, and there's nothing you can do." Or, "Terraform is full of bugs, that's what's causing the problem." By this point, our expert was regularly spending more of the week off work than at work, and I was starting to have awkward discussions about our duty of care to someone who appeared to be suffering from serious workplace stress or depression. At that point you can't start asking awkward questions about peer review or what's actually been done, because you don't know what kind of mental hell you're putting the poor guy on the other end through. In the end, all we could do was ask him to put what he had in source control, take a long break, and let us figure out what needed to be fixed and how.

There are a lot of lessons around this. But the biggest one: stop the line. Don't let yourself start thinking, "it's just the way he works", "he's the expert here" or, "maybe we just didn't know what we were doing". If something doesn't feel right, it probably isn't right. Be aware that not everyone asks for help, even if they need it; whether that help is on a personal or a technical level.

Second biggest is that silos just create opportunities for people to hide the amount of trouble they're in. You can't expect peer review if you don't give someone any peers.

Mistake 5: No Scratch Environment

There's a problem I commonly see in companies, which I call "environment escalation". Environment Escalation is where you create a development environment, happily develop against it, only to one day suddenly receive an extremely stroppy e-mail with half your senior management CCed saying, "you broke the environment during an important client demo, this is simply unacceptable."

Fundamentally, unless you explicitly ask people to agree, "this is a scratch environment, I will not use it for anything important and I will not give anyone else access details" before giving them a URL, you will find that over time all environments become effectively production due to their use for integration work, client demos, or just the CEO happening to have the pre-production site set as his bookmark. (Sometimes, this happens even if they do agree to an environment being scratch.)

This becomes a big problem for an auto-scaling, auto-recovering environment. You know how you test one of those? You take things down. You see if they come back up. If they don't come back up reliably, you fix your scripts until they do. You can't do that if the moment you kill a server, an e-mail goes out to 20 people demanding why the dev team dare take down such a critical service.

Lesson one: when you create an environment, you need to document exactly what the guarantees you're offering on that environment are. Ideally, have a highly visible header or footer on every web UI stating, "SCRATCH ENVIRONMENT - MAY GO DOWN AT ANY TIME" on the environment you're using to test your automation scripts. Make sure everyone knows when something is unstable - and if they need it to be stable, be clear that you're going to need a different environment to test your infrastructure code against. A fantastic tip from Google here: if something is supposed to be unreliable, make it unreliable. Create your own deliberate outages so people don't get surprised by the first accidental one.

Lesson two: test to destruction. We'd have soon spotted our issues with manual hacking if we'd had a chaos monkey script destroying a random bunch of services every day and seeing what recovered by itself. Also, we'd have learnt a lot more about what logging and monitoring we needed in order to catch things going down.

Mistake 6: The Endless Kanban Of Doom

Photo.
All of the above still wouldn't have been a big deal if we'd followed the Scrum process. We'd have got to the end of our first sprint with infrastructure tasks, failed to achieve a bunch of objectives, and stuck, "DevOps isn't going well" on the wall in our retrospective. Worst case, we'd have gone one more sprint and then had a bunch of Post-It notes saying, "DevOps is getting even worse" and we'd have sorted it.

But we didn't follow the Scrum process.

Working within Little Team, we'd got everyone around us pretty used to Scrum. They knew that we raised problems early, we fixed them, we got back on track. It took us a while to convince the project managers who interacted with us that not only was it normal for half of our status reports to be walls of amber and red, that was actually the mechanism by which things started going green as we approached our deadlines, but once we'd convinced them they were happy, not least because our projects went green at the end rather than red.

Within Enterprise Land, we just couldn't get that attitude far enough up the chain - at some point, we'd hit someone who was so used to delivering watermelon metrics (green outside, red inside) that they wanted to change our process to make it look more green. Which ended up as a tremendous amount of pressure to, "just do a Kanban".

Getting Kanban right requires a lot of discipline. You have to constantly be aware of how easy it is for tasks to drag on, especially when they require multiple people to work on them. Lane limits need to be set strictly and enforced rigidly, to the point of stopping the line and sounding the alarm if, for example, too much is stacking up without being tested. It also helps to do a little bit of estimation, or failing that ask what's going on with your outliers when something sits in progress for ages without moving.

No surprises for guessing we didn't do any of this, and our attempts to fix it from the bottom were frustrated by the continual pressure to "report up green". Plus we had a broken guy as our bottleneck, who was either moving completely untested (sometimes even completely non-functional) code to "Done" or straight out wasn't there at all. Combine that with Mistake 5, and we were worsening our already slow progress by spending half our time in escalation meetings to explain why the development environment was down again. Worse still, with so many people involved in the project, the Kanban was filling up with badly-described tickets, duplicates, and tasks that had a definition of done so poor it turned out to be impossible to complete them.

The solution here would have been simple: don’t have a monster Kanban of doom that anyone and everyone can dump tickets in. If you must have one, then make sure you have a dedicated product owner who does some measure of quality control on tickets.

Mistake 7: Communication and Enterprise-itis

First, let me define Enterprise-itis. Enterprise-itis is the condition where you see a problem that could be solved simply by half a dozen people working in a tightly-knit team, and immediately decide that it would be so much better if it was split between four teams, each with their own project manager, development methodology, and time zone.

Interestingly, it's not the multiple teams themselves that are a problem here; it can work perfectly well… providing all your communication is peer to peer. We didn't do this; we appointed what amounted to gatekeepers through whom all conversation between teams should flow.

Never do that. You might need people to help facilitate conversations, but the key is that they facilitate. Once you have people who are gatekeeping, you immediately run into two problems. First is that your intermediaries will misunderstand things or otherwise fail to pass them on. Second is a problematic consequence of human nature. Shorn of direct communication with their peers, teams will start to develop us and them mentalities. This is not conducive to problem solving in a complex solution where a problem might have causes in several different areas of responsibility.

Again, this was a problem we just didn't have on Little Team. There, our procedure was generally to ask who broke something (or pre-emptively admit we'd broken it) and fix it within five minutes. With all those intermediaries, the procedure became escalate to a manager, who escalated to another manager, who escalated… and usually the first anyone heard about a misconfiguration was being dragged into a meeting anything up to a week later, to be stared at by lots of angry senior managers saying things like "P1 issue", "at-risk delivery date" and "blocked for days".

Don't be like this. If you're a developer, ask to talk to your peers on the other team. If you're a manager, learn to ask the question, "why is this being escalated rather than being fixed by people talking to each other?" Even something as simple as getting everyone on a single Slack channel to discuss outages and errors can be a huge boost to productivity and co-ownership.

What to do when it's all gone wrong

Photo.
Add together all of those mistakes and you too can end up in the place I was in earlier in the year: watching a highly visible project hurtling toward the rocks, driven there by a huge committee of people all convinced they were doing the right thing. Our release date was two weeks away, our Kanban was a towering nightmare of unmoving tickets, our DevOps guy was AWOL, our teams were at each other's throats, and to cap it all nobody outside of the immediate team had a clue the situation was anything but green lights all round and everything ready to go.

Here's the twist: we went live on time. In the launch week I had only a single support call, which turned out to be someone trying to find out who could reset their password.

How? After a year of living in Enterprise Land, we took about half a dozen people and went Little Team on it. We admitted to ourselves: "This is broken. We don't know why it doesn't work. We don't know how to fix it. But what we've got on our side is we're smart people who can talk to each other."

We started by taking our enormous backlog of items and working out which were outright business blockers to going live, which were outright technical blockers to going live, and which were horrors where we'd just have to cross our fingers and hope nothing went wrong in production until they were fixed.

Next we took the smaller set of things which would outright prevent us from going live, and split them up so every ticket in there explained what needed to be done, how to tell when it was complete, and how long we thought it'd take. We then deleted the duplicates and shoved everything else way down on the backlog, spent a few hours patiently explaining to people why their "critical" ticket had been moved off the priority list, and got to work.

This was the classic Little Team pattern. Take something that needs doing off the wall. (Oh yes, we also built a physical wall board out of index cards and thumb tacks, so everyone could see where we were at rather than relying on an intermediary to filter the information.) Work out how you're going to solve it. Keep testing against the definition of done until it's satisfied. Don't move on to a different ticket until you're finished. Repeat this until all of the blocking issues are fixed.

Doing this was hellish, because none of us had any clue what we were doing at first, the deadline was looming large, and every problem we fixed uncovered several more things we just could not go live with.

However, as with Little Team, knowing what exactly you need to do and believing you'll figure it out turned out to be sufficient. About three days before go-live, I lifted my head from Terraform manifests and user data scripts for the first time and realised that actually, we had a system we could go live with. Not a good system, but at least something that I could trust to run unattended without taking itself down in an auto-scaling incident.

Good Idea 1: Fixing it

The one thing we did do right; own our technical debt. Yeah, we screwed up. But as a result of that, we sat down and worked out what had gone wrong and how we were going to fix it. Better than that, we stopped trying to report up green, went to our biggest technical stakeholder, and straight-up admitted that the system they'd piled money into for months on end was held together by duct tape and chewing gum, and needed over a month of remedial work.

They agreed to this. Maybe a couple of outages on our now-critical dev environment helped, but it's also a worthy reminder that the big and scary senior people are often accommodating, providing you actually take the decision to them rather than trying to bury it in charts and spreadsheets saying everything's fine when it isn't.

This isn't to say all the problems disappeared overnight; we still suffered from environment escalation, poor Kanban hygiene and communication issues, but we also brought a bit of Little Team to proceedings. When we got error reports direct from the people who noticed them we immediately dropped what we were doing to triage; when they came through as escalations from managers we just yelled at people for escalating things that could have been resolved with a quick chat. (It only took three escalation meetings for everyone to learn that it was best to ask the team before sending an angry e-mail to a manager.)

What we also did was spread a lot of skill around in-house. Fixing something that's broken is an excellent way to learn a lot about your tools very quickly, especially when it comes to how not to use them. Because we didn't want to have to fix the same problems again, we also got good at documenting the ways not to do things.

Good Idea 2: Making the team smaller

One interesting trend over the course of our DevOps death spiral was that the team associated with it got smaller and smaller. We started out with over forty people involved, but by the time I was committing the final fixes I was only dealing with four or five people. This wasn't only a step change between the original project and the post-project clean-up either; it was a continuation of a trend that had been running throughout most of the project.

The important point about this is every time someone is removed from a large project, it removes a huge number of communication channels which need to be maintained. This is even more important when they take one of the gatekeeper roles; suddenly you're able to talk to people directly, and things get a lot more productive. At the start of the project I was maybe four or five steps removed from my ultimate stakeholder; by the end I was only one step removed, which meant I could get decisions and information far quicker.

I don't want to sound overly cynical here, but having a project which spends a lot of time looking like it'll be a colossal failure is an excellent way to find out who is committed and who is merely involved; the latter tend to quickly find something else to get involved with when things look like they're going south!

Good Idea 3: Learning

Photo.
This is a really important one; we did make an effort to learn from this. A lot of organisations don't. Hell, we've had our fair share of restrospectives packed with great lessons, followed a few months later by yet another retrospective full of exactly the same mistakes and exactly the same lessons.

I'm not saying we're perfect, and there's a long way to go, but when I look around at the projects we've started since, I see:

  • Smaller teams, who are focused on a large product goal rather than a single component.
  • Shorter chains between developers and stakeholders.
  • DevOps-focused engineers associated with a product team, rather than a function.
  • Skills such as build and infrastructure engineering being shared amongst a team, with everyone doing a little of it at some point.

That's the most important thing out of all of this, for me. Little Team did it right, but didn't know why; which meant we weren't able to give the right advice when things did start going wrong. In Enterprise Land, we didn't just get it wrong, we got it spectacularly wrong. But we understand why, and now we can see the warning signs in new projects.

I hope you've enjoyed reading this, long as it was, and it helps someone avoid some of the same mistakes we made.

Addendum: Technical Mistakes We Made

I won't go into these in detail, because I think the organisational errors we made are more important, but here are some of the technical mistakes we made as a result of all those organisational errors and trying to push DevOps into its own little silo:

  • Bad health checks destroying instances that are actually healthy, causing endless reboot loops
  • No error handling in deployment scripts, causing automatic deployments to be blocked
  • Servers not registering themselves automatically with the deployment system
  • Doing things outside Terraform, then having them overwritten when Terraform is next run
  • Assuming errors in scripts are Terraform/AWS/etc. bugs, rather than investigating properly
  • Doing too many Terraform partial apply operations, missing key resources
  • Failing to test what happens when instances go down across all types of instance, databases included
  • Not specifying explicit versions of software packages, causing incompatible packages to be installed on newly-provisioned servers
  • Relying on manual steps to fully provision servers (!)
  • Insufficient error checking in provisioning scripts
  • Incorrect clustered software configuration causing clusters to be unable to auto-recover from a lost server
  • Leaving backup scripts and log shipping configurations in a non-working state

It almost goes without saying: don't do any of the above. It breaks stuff.