Cloud, PaaS, solution size and lock-in
I have a bit of a counter-intuitive opinion on cloud services and the Platform-as-a-Service (PaaS) world. For me, something like AWS isn't about "big". Yes, I know plenty of companies use elastic scaling to run huge, demand-based infrastructures along the Netflix model. But for rest of us, we're not building Netflix. We're building simple CRUD-based apps that barely trouble the hardware they sit upon. I've got websites where even peak throughput isn't anywhere near the number of requests I can serve from a dual-core laptop running a debug build.
Cloud for me is about quick bootstrap times for projects with small requirements. I would like to be in the position where I'm using the infrastructure scalability, but usually by the time I've got enough instances to have my application balanced across multiple availability zones and resilient, I've got way more compute capacity then I'm ever going to need even for peak loads. Most of the time even the smallest burstable instances are ticking along at 1% CPU - the only thing I'd really use a scaling group for is to replace one on the fly if it suddenly falls over.
What I am likely to do, though, is build applications which may or may not be successful. It's not always easy to predict what will resonate with the market before your first trial release, especially in business-to-consumer propositions. So if I have an option that lets me deliver 2 or 3 applications in the time I'd normally take to produce one, I'm going to take up that option. It's doubled or tripled my chances of success for no significant disadvantage. Because here's the thing - up front, I don't really care about PaaS lock in.
The standing argument here is, "but what if you're pushing 10,000 requests a second through this and Amazon suddenly double their charges?" What this misses is the context: having to back out of a PaaS solution like Kinesis, or SQS, or Azure Service Bus because it's not cost-effective is a nice problem to have. It means you've got something successful. People are using it, and as a result you should have enough income to invest in a proper, gold-standard solution to the problem.
The reality is you're far more likely to be sending maybe a few dozen requests per hour (maybe even only per day) through this and a doubling of the price is the difference between $0.01 and $0.02 per month. At this point any increase is going to be utterly dwarfed by the extra cost of rolling your own solution, especially if it needs instances to run on or someone to support it.
The other problem with rolling your own is you're not going to get it right on the first try. Take message queues as an example. I personally love Kafka, but there are plenty of war stories about newbies inadvertently creating two Zookeeper nodes with the same ID, or letting their consumers lose track of the position in their queue, and taking down production systems as a result. Things get somewhat worse when you start talking about systems which don't offer the same robust guarantees as Kafka, and they rapidly become a horrible mire when you build your own queuing system despite thinking "consensus protocol" is something they do in parliament.
The other reason I don't care too much about lock-in is because I don't want to design my systems in that way. This is the dependency inversion principle at work, with a dash of Liskov: my software doesn't couple itself to SQS or Kafka or whatever else. It simply asks, "give me something that has queuing semantics" and I only have one small corner which takes that abstraction and makes a concrete association with the underlying system. Hell, most of the time I'll start out with a rudimentary in-memory implementation, so I can get things up and running quickly and understand what I actually need from my queue, key-value store and whatever else. Sometimes this attitude will involve a little bit of extra work to make sure there's a decent amount of buffering between my code and my external services, but if it helps me back out of things later it's worthwhile.
There is a caveat here which is that while backing out of ephemeral PaaS (queues, e-mail services, etc.) is easy, backing out of persistent PaaS (databases and file stores) is trickier. It's not impossible, but you need to factor in being able to get your data out of the thing you just put it into. Again, a little bit of upfront thinking here can save you a lot of pain later.
The final thing to mention here is cost. Too many people forget this. PaaS typically has a per-use or per-call billing structure. If your numbers don't stack up for a single customer, they're not going to get any smarter when you have a million customers. It's not like traditional servers where you get beyond a certain point and every additional customer is pure profit (plus or minus bandwidth overage) - your numbers need to work. This means doing the maths and figuring out that someone pays £2.99 a month, and they'll have 20 interactions with this which generates 100 messages on that and half a dozen e-mails over there... and at the end of it we still take a decent profit even if they go a bit mad and triple their activity.
So, in conclusion:
- PaaS, for me, is important for small - being able to bootstrap quickly so my company can iterate through ideas.
- Design software so you can swap bits of it out at will. This is good practice generally, but it's critical for cloud services where the chances you'll want to back out are higher.
- You can defer doing the hard work. Effectively, this is the "nice problem to have" - if you're spending enough on a PaaS message queue that you need to roll your own, that means you're sending a lot of messages!
- You've got to get the numbers right. This isn't using spare capacity on a server that's already paid for - everything you do is costing you money.
Get this right and you'll find you have a better time with cloud services - and remember that a lot of this is down to your particular scenario. If it's not saving you time, money or both, then it's not worthwhile.