ECS and containers with long bootstrap times

I've been playing with Amazon's EC2 container service (ECS) recently, and it's proving an impressive piece of kit. It's not quite one-click, but as a set of tools to automate putting Docker containers on machine instances, registering them with load balancers, and automatically handle connection draining and migration when you update your code it's impressive and makes things a lot easier than traditional infrastructure.

There are some fun new things to learn, though.

We encountered a strange problem where one of our containers would launch successfully, process a bunch of input files it needed in order to bring up a web service, then as soon as the service came up, shut down and start the whole process again. Even more oddly, an instance which had been running the same container for several days worked fine - and running the container in a regular virtual machine didn't cause trouble either.

It turned out to be a seemingly innocuous load balancer configuration change. We'd made the health check more frequent. Unfortunately with ECS, when a service fails more than a certain number of health checks, the load balancer will ask the ECS agent to restart the container. And the container is registered with the load balancer the moment it starts running. This is fine for most containers, but if your container takes a long time from 'docker start' to actually serving requests, it's going to fail a few health checks while it gets going. In our case, it failed the threshold number of checks just before it came online... at which point ECS restarted the container to do the same thing again.

The lesson there is twofold; make sure your containers start swiftly, and make sure your load balancer is lenient enough on failed health checks that there's enough slack for even a slow startup to complete.