A Real-World Scaling Challenge

A Real-World Scaling Challenge

This week I've been solving scaling challenges. Not the kind of CV-embellishing "at global scale" stuff, but some simple and honest problems I actually have. It's all bundled up into one scaling problem, which breaks down roughly across the following areas:

  • Stuff we deliberately kept simple and scrappy to save time.
  • Dumb stuff Kimber did from not understanding the problem or the platform.
  • Constraints we hacked our way around, but now need to clean up.

In a context of aspirational tech blogs and presentations given by 1000+ person companies, this may seem an almost cowboy-like admission of throwing things together in between too much time concentrating on rootin' and/or tootin'. But this isn't that context. Let's go up a level and talk about scaling as it happens in the real world.

Scaling In The Real World

Most of us aren't working on "this needs to handle millions of concurrent connections" problems every day. I have a standing theory that the two scaling problems most commonly faced by software engineers are:

  1. 0 to 1

  2. 1 to 2

To put it into words, "making it work the first time" and "making it work when two things happen concurrently".

The biggest obstacle to "making it work the first time" is the classic one: you don't have a product until you've built and deployed it. One of the easiest ways to make getting from 0 to 1 hard is sitting in front of your IDE going, "how am I going to make this handle 10,000 requests at a time from 10 million unique users?" Scaling from 0 to 1 involves building something, deploying it, making sure it has connectivity and makes the right network requests, and so on. The most resilient, global-scale design still has a capacity of 0 if it only exists on a drawing board.

Once you have something which works you will then encounter the next scaling problem, "what if two things happen at the same time?" This might sound like a joke, but some real scenarios I encountered for exactly this problem:

  • A multi-stage form saves the current answers to a database. There's no unique session or user key involved.
  • A queue-driven system assumes SQS always has exactly one message to process.
  • A queue-driven system has only a single consumer thread and discards messages if it is not available.

Guess what happened in these scenarios when the second user attempted to use the form at the same time, two messages arrived within SQS's batching window, and a second message arrived while the first was already being processed? 1 to 2 is a very real scaling problem.

It's also worth noting that the two queue-based designs followed the high level design for systems capable of handling many thousands of messages. It was the low-level configuration details where they tripped up. Concentrating on "100,000" when you really should be thinking about and testing for "2" can be harmful.

Once you've solved 1 to 2 as a scaling problem, you're usually good until somewhere around 50-100, depending how computationally expensive the individual operations are. In my experience, once you get to that point you'll discover there's a scaling factor or bottleneck you didn't expect which only becomes obvious with the increased load. This holds true even if you engineer in plenty of autoscaling.

Of course, engineering in all of that autoscaling takes time and effort, which is wasted if you get to 100 concurrent requests and find that while your compute is sitting at minimal cluster size with 5% CPU usage, a database query you thought inconsequential has maxed out disk I/O.

Scaling Rules Of Thumb

All of this gets me to my rules of thumb on scaling in the real world:

  • A good scaling improvement will buy you an extra order of magnitude (e.g. fixing 10->100 will most likely get you to 1000)
  • It's not worth trying to engineer more than those 2 orders of magnitude at a time
  • 0->1 and 1->2 should be considered orders of magnitude in and of themselves

These rules are most effective applied iteratively, when you need the next scaling increment. (It's possible you might need to do 2-3 loops immediately if building a high-volume system from scratch, but as mentioned above this is rare for most of us. It's more common to be extending an existing high-volume system, or building something new that not many people will use.)

So what did this mean for me in my day job?

Scaling Products at Bezos

At Bezos, we need product definitions to be up-to-date in every warehouse we might send orders to. If not, warehouses might pick the wrong products, put them in the wrong packaging or miss important packing instructions - all of which mean disappointed customers and frustrated sellers.

At first, we had a very simple approach, based on an AWS Lambda function which would run on a cron-type schedule, performing the following:

  • Get the full list of products from Bezos' own system.
  • For each product, get the definition from each warehouse the seller's orders might be sent to.
  • Update/create the products as necessary.

What a nightmare, right? I once comprehensively failed a tech interview for even mentioning this kind of a solution off-hand as a theoretical possibility, the interviewer being so horrified I could even consider it a valid concept that they spent the rest of the interview trying to show all the ways it wouldn't work if, say, a million users suddenly turned up.

Points to consider here:

  • We didn't have a million users.
  • This was our production system for our first 15 months or so of operations.
  • It worked fine.
  • I mean that. Other, theoretically better-designed systems caused us more hassle.
  • It was quick to build and easy to maintain.

The point is we solved 0->1 and 1->2, which gave us a working system at a critical early startup point where the most useful thing you can do is get systems working and deployed.

For transparency, here are the problems we encountered over those first 15 months:

  • We synchronised a whole bunch of irrelevant information to the warehouse, which involved up to 20-30 network calls per product to obtain and update.
  • We added support for a warehouse management system (WMS) that charged per API request, which made frequent synchronisation expensive.
  • Some WMSs returned data in a way which was hard to map to ours, causing attempts to repeatedly update the same product where it appeared to differ from our definition.

Note that while our approach worsened these, building a more scalable system would not have prevented any of them. It might have let us live with the irrelevant information problem by making "slow syncs" less impactful, but since fixing that took the sum total of 10 minutes to update tests, delete code and deploy we were still ahead even after that.

Of course, after 15 months we hit what you're all waiting for:

  • Getting the full product list on each run took so long that the lambda function we use for product synchronisation could not finish without timing out.

This happened at around 2,000 live products in the system. When we built it, we barely had 20 products - that solution had quite happily served two order of magnitude increases (10->100 and 100->1000) and while there had been workarounds they were for unforeseen problems or things we didn't understand, not the obvious scaling ones.

One lesson here: originally our lambda execution was so quick we didn't think to put alerts for timeouts and/or the execution time increasing significantly from a baseline. If you're going to defer your scaling you do need to think about the observability side, so you know when you need to take action.

Scaling Round 2: Schedules

The benefit of deferring scaling the system beyond the ~1,000 order of magnitude is we learnt some useful properties:

  • The systems we interact with to get product data are almost all polling-based; there are no event sourcing mechanisms.
  • We have a fast API call for recently updated products.
  • We have a paged API for getting a complete list of products.
  • The paged API has a low rate limit for multiple repeated requests, for reasons outside of our control.
  • We suffer from drift if we don't check all products at some point (warehouses inadvertently edit details of products)
  • Warehouses do not necessarily have recently-updated or product listing APIs.
  • Getting a single product from a WMS is quick for all of our warehouse partners.
  • Most of the time is spent getting the list of products.
  • The positive user experience of changes appearing instantaneously in downstream tools is important; our users didn't "fire and forget" like we expected.
  • We don't need to automate the deletion of products, as this is rare.
  • Products per warehouse does not experience as much scaling pressure (as we add more sellers and products, we also add more warehouses)
  • More than one process needs to obtain product information!

That last one surprised us - but we realised over time that we didn't only need to send products to a WMS. We also needed to keep a log of updates and current definitions, feed data to internal consistency checkers and put suitably transformed product data in our AI/ML pipelines. Our initial design followed a microservice-like approach, where each of these facets was requesting its own product list each time. This sounds like a scaling nightmare but in reality it didn't make much difference in terms of how many products in the system could be handled; nothing ran at high frequency because we didn't realise how important it was at first.

Knowing this, we separated the problem into two core responsibilities:

  • Get lists of products which need to be checked (we call these "schedules")
  • Process a single product in a single way

This led to a solution where the product schedules could be assembled once, put on a topic, and the individual products would then fan out to consumers which would individually update the warehouse definitions, populate reporting databases, feed the consistency checkers and so on.

Within the schedules we have multiple different "rings". Bear in mind this is ultimately a polling-based system, we're just hiding that clunkiness from the processors. Rings are how we poll in a smart way, to maximise responsiveness without having colossal system load. To start with, we identified two rings:

  1. Recently updated products
  2. All products

Why do we separate these? Well, recently updated products is a fast and cheap API call, which we can make frequently. This means we can publish schedules of newly updated products within a minute of them being changed. (For our users, this is effectively "instant" due to the time needed to navigate the WMS). Listing all products is much slower, but we can gradually process a page of the paged endpoint every few minutes so there's a gradual but constant feed of things to check. This is needed in case the data isn't correct in the warehouse or wasn't picked up by the recent updates API. We limit this to only make one complete crawl of the products list per 12 hours, as such defects are rare and usually minor.

Finally, there's a little bit of configuration in the schedulers so that we can run lower frequencies for warehouses which can't handle the volume of API calls (or can, but would get a large bill for doing so). This would be painful if we had to repeat it across multiple order processors and keep up-to-date, but because the scheduling is only in one place it's easy.

What's important to note is the scaling problems we haven't tried to address:

  • At some point we will have more products than we can crawl in 12 hours. The current solution gets us to ~15,000 products (the next order of magnitude) and if we're prepared to alter the scheduling a little we'll be able to handle 100k.
  • Getting recently updated products is currently fast because there are not many. This may follow a similar path to our original assumptions about all products as we have more sellers and more products being updated at a time.

There are possibilities here. I could start sharding Bezos' system according to a particular dimension (e.g. seller or destination warehouses), I could only check the products which have orders waiting to be fulfilled, or any number of other solutions. But that would be solving scaling problems I don't have.

Why would that be bad? Well, look at the list of things we discovered. If we'd tried to build a 10k or 100k system from the off, we'd have over-estimated the importance of solving the warehouse side of things, underestimated how important "instant"-feeling sync is for our users, failed to realise just how slow the list endpoint is when there are a lot of products, and spent a lot of time worrying about product deletion and transferring unnecessary data while ignoring the real issue of products getting inadvertently edited in the WMS. Our "100k" system would still have choked at 2,000 products, because we simply didn't understand the real scaling bottlenecks until we got close to them.

Similarly, if we try to build a "1m" or "10m" system now, the odds are we'll discover something unexpected that causes it to hit its limits somewhere on the road to 100k. We end up spending more time to get something that takes us no further, and then is more complicated to tear apart and fix when it breaks down.

In the hard-headed commercial sense, I'm focusing on what sustains growth now so I can move on to the next growth-sustaining thing. If I spend all my time on future scaling problems, then I'm not building the things which will grow us to the point those problems become reality.

Scaling Problems: Fractal and low-level

In getting the "1k -> 10k" solution to production, I ended up encountering a bunch of 0->1 and 1->2 problems. Here are the highlights:

  • Publishing to SNS/SQS requires a JSON roundtrip, and consumers might consume more than one entity type. Building a nice library that allowed our code to do this with arbitrary entities (and have them survive the trip!) was a bigger task than I thought, and my first approach wasn't clean and needed rewriting. This is a 0->1 problem - "first time we've tried to push arbitrary data over a topic/queue structure".
  • All the queues, topics and necessary permissions are a big chunk of infrastructure code which takes a while to deploy. Again, a 0->1 problem, and one I'm glad I didn't have earlier on when I'd have been spending time building and debugging something I didn't need yet, with a painfully slow feedback loop.
  • SQS is surprisingly intolerant of driving a lambda which can throttle, at least with default settings. This was a classic 1->2 problem; one message processes fine, but when we send the second it ends up on the dead letter queue. The quick workaround here is to increase the batch size so more messages are consumed from the queue rather than hanging around, and also increase the SQS visibility timeout and max receive count. Increasing lambda concurrency to 5 helps a lot if the downstream systems can cope with this level of parallelisation, as 5 is the default number of concurrent polling operations Lambda makes to an SQS queue. Using an SQS FIFO queue looks like a more robust solution, as this prevents the messages bouncing back to SQS and increasing the receive count when all provisioned lambdas are busy.

What's worth noting is that while queues and consumers is a classic whiteboard exercise that we think of as a "clean" solution, implementing it in the real world requires quite a bit of low-level configuration effort and carefully monitoring things like dead letter queues to see if messages are being dropped. A danger of doing this kind of building for scale too early is we wouldn't have enough volume to spot those problems, and they would creep in over time rather than being obvious from the start.

Conclusion

I wanted to write my thoughts here as an example of the kind of real-world problems you see outside of aspirational presentations and interview exercises. If I had to list out some kind of, "what I'd like people to take from this", it would be:

  • Solve the problems you have now, not notional future ones.
  • 0->1 and 1->2 are the most common hard scaling problems.
  • You learn most about where and how a system needs to scale by operating it.
  • If you try to solve too much, too early you won't get the right solution.
  • Even the cleanest design has low-level implementation detail, often unexpected.

Oh, and it's great when people completely outside the engineering team tell us, "I love the new product syncer, it's so fast". There's a validation you're on the right track!