IaC and the post-environment world

For about a decade I struggled with the common problem of environments.

You usually start off with some well-meaning attempt at a simple layout, with development, integration, staging and live. Which feels fine, until you realise your company has different release cycles for software and data. So you split up staging into an environment which has the latest code but data from live, and an environment which has the latest data but the stable version of the code.

That's manageable - but then you find that your marketing team are getting frustrated by trying to demo things from code-staging and encountering bugs, so you add a demo environment for them. Your training team are struggling with having to rewrite their training materials every time the data changes, so you guess you need training which combines live code with a fixed data set.

This is probably the point at which your operations people step in and ask, "you want how many servers?" The realities of cost and capacity bring problems. That could be lots of notional environments collapsed together - boxes which are staging, except when they're demo, except when they're data QA. Or it could be a server carved up into so many VMs that you spend half your time explaining to frustrated stakeholders that the demo is slow because it's running on a single-core box with 512MB of memory, and things will be okay on live.

Finally, after a few years all of your environments end up full of cruft from abandoned projects, snowflake configurations that were hand-hacked for this reason or the other, and the inevitable special hand-deployed version of an application which didn't fit the environment scheme at all.

The root cause of all this is that given software version x, API version y and data version z, you'll inevitably tend toward the situation where everybody wants a different combination of x, y and z to suit their needs, be it integration testing against a known data set or demoing new features to a client.

The solution? Stop building static environments.

Take an example of an organisation using either a public or private cloud, with scripts to bring up needed infrastructure and some way of deterministically placing a specific version of the code or data on it. (I've got containers in mind here, but there's nothing to stop you following this model using something like Octopus Deploy with NuGet packages for your application code and database scripts). Rather than trying to figure out when it's "safe" to deploy to demo, or reconcile two stakeholders who want different versions of things, all you do is ask, "what versions do you need, and how long do you need it for?"

Then once you know those requirements, you feed them into your scripts so you spin up a couple of VMs and deploy application codebase 1.4.6, API 2.22.9 and data update 2016.10.19-1 to them. Send the URL to your stakeholder and forget about it. If someone else asks you for app 1.5.3, API 2.22.1 and data update 2016.09.12-2 you don't have to figure out whether that's staging or demo or dataQA, you just spin that for them.

The other bonus is running things like performance tests or final UAT which require an environment as close to live as possible. Most companies shy away from this due to the sheer expense of having a high end multi-server, redundant and load balanced environment sitting there idle most of the time. If you create your environments on demand, this is less of an issue - you spin up your big production-like environment for an hour or two, run your performance tests, then terminate it.

This is so much better and more cost-effective than maintaining 24/7 hardware for all these oddball environments and getting into arguments as to why this was deployed there ahead of an important client demo. Plus by switching your transient environments off when you're done with them, you avoid the build-up of legacy software or well-meaning people installing critical services on a QA server because that happens to be the box they've got access to.

That's not to say there aren't new challenges with this approach. Some of the problems you'll have to solve include:

  • Billing models - costs are less predictable than a static batch of servers, albeit more attributable.
    • You need to make sure environments are tagged so it's obvious where the money is going
    • There are cultural issues too; people don't always like having a cost directly assigned to their client demo.
    • In public cloud, make use of reserved instance capacity for your baseline demand, and spot pricing for things which are less time-sensitive such as performance testing.
  • You need to be very disciplined at making sure unused environments are terminated. It's easy to rack up large bills by leaving things running when they're not being used.
  • You need a good relationship between stakeholders and developers. This approach won't work unless people are used to saying, "I'm going to demo x to client y today, what URL should I use?"
    • Having a plan of when things are needed allows you to reduce environment lifespans from days to hours, making things far more cost-effective.
  • A certain amount of extra effort is required to make sure your infrastructure scripts and deployment packages are truly generic, and can create environments from scratch.
  • Be aware of resource limitations (such as core count limits), especially using private clouds or on-premise hardware. Don't unnecessarily proliferate environments or leave them running when they're not needed.
  • You need to understand the difference between "on", "off" and "terminated". Often users will want their data to persist between uses of the environment, but don't actually need it running all the time. On the other hand, when you're finished with an environment you don't want to pay for the storage it used forever - terminate it!
  • Shared APIs and co-ordination in large organisations can become a challenge. Consider having a self-service dashboard so people can bring up environments themselves on demand.
    • Serverless designs using PaaS tools and services like AWS Lambda can also help, by allowing you to have long-lived environments at low cost.

Although this seems like a long list, it's still a hell of a lot better than trying to thrash around in an undersized box full of cruft from three years ago, or moderate yet another argument about who changed who's code or data. Many of these are problems you only have to solve once, and can then ignore while you get on with releasing software. The post-environment world is here, and it's a good place to be.