How to spend next to nothing on AWS

How to spend next to nothing on AWS

Here's something you don't hear often from a head of engineering: my AWS bill is tiny. I wish I could share the actual figure because it looks like a typo, but to give you an idea there's a place where I could save another 20% with a little bit of engineering work and I don't do it because it would take over a year to break even on the cost of that work.

But... how?

The answer comes down to a surprisingly small number of simple things:

  • Avoiding things which are expensive to not use
  • Consolidating always-on resources
  • Good old-fashioned optimisation

Expensive to not use?

The classic AWS cost footgun is leaving instances running, especially so in today's AI-enhanced world where those might be beefy GPU-equipped instance classes with equally beefy prices. But AWS is full of this. Services where 24/7 availability means you need 750 hours of a $0.10/hour resource, and suddenly you're in for $75 per environment or even per project to occasionally run one container in response to an HTTP request or whatever.

This is fine if you're actually using these things, but if you're not that can easily mount up to a lot of money spent on not doing very much.

There are some services which are very cheap to not use:

  • SNS
  • SQS
  • Lambda
  • API Gateway
  • Cloudfront

And I know, these are the ones you get warned off because at high volumes they can become catastrophically expensive compared to running your own, but the trick is to remember that not everyone has these high volumes, you absolutely have the data on whether you do or not, and you shouldn't pass up running your entire business on something which doesn't even give you the free tier warning email until the 20th of the month, just because it might start getting more expensive than running everything on instances once you have 1000x the volume.

(This, so long as you got the unit economics right, is a nice problem to have and therefore fine to defer until you actually have it)

The other thing with the SNS/SQS/Lambda architecture is you can push the idea of "only compute when you need to compute" hard. At Bezos we've built the entire platform around these services and so while I know I have something that can process a sudden spike of 5,000 orders without a problem, when there's not any products or orders to update it's costing me $0.

Even where you do need good old-fashioned EC2 instances, consider bringing them up and down on demand if you can tolerate the cold start time. For example our build servers have an EventBridge-driven Lambda function which switches them off if no builds have run in the last few minutes.

Consolidating Resources

This is a good start, but there are still some things which will need always-on services, or where your particular use case is going to make the serverless version become unsustainably expensive by scaling along a dimension which isn't directly driving revenues.

For us, these always-on services are:

  • RDS
  • ElastiCache Redis
  • NAT gateways

What we've done here is consolidate them. Each environment has one decent-sized database cluster and one decent-sized Redis cluster. This goes against conventional wisdom which is to distribute these things so the load from one system doesn't affect another, and I will admit we have a tradeoff here - we're spending less on databases but we do need to monitor more carefully and make sure all connections are tagged with an application name so we know where the load is coming from.

(There's also the consideration that operating a Lambda-based platform results in a lot of database connections - managing this and reducing the amount of dependency on the database as a single point of failure is part of the work we have to do as an engineering team)

The other thing here is to strike the right balance between using the correct tool for the job and not proliferating services, as these three items account for over 50% of our monthly bill. For us, most of the things we need fall into the area of either a relational database or a fast key-value store, so this setup works well.

Optimisation

The real secret, though, is optimisation.

In particular, database optimisation. While it's common practice to let an ORM handle the dirty work and hope that it does a reasonable job of writing optimal queries, we hand-tweak all of our common SQL queries and constantly monitor the database to see what's most expensive in terms of compute time and if there's more time which can be shaved off it.

Despite order volumes growing more than 4x in that time, our database load has gone down since our last major platform overhaul. This is great, because it pays off in two places:

  1. We can run a smaller, cheaper RDS cluster
  2. We spend less on Lambda functions waiting for queries to execute

When it comes to those Lambda functions, we aim to create functions which fit in the basic 128MB class without being bottlenecked by CPU or network bandwidth. This is helped by working in Go - this involves some compromises not having quite the same richness of framework and ecosystem as more standard web languages, but the ability to be in control of what's going on and write blazing-fast code with minimal requirements is a worthwhile trade-off.

The end result is where I want to be - a small team getting a lot done, and not getting distracted with rounds of cost control or worrying that we've built a beautiful reference implementation that cannot ever be profitable at our current scale. (Whereas what we have more than pays for itself before we even finish the first day of the month)