FWIW, I came to a similar conclusion: blue print, slides. That does mean that you are responsible for your own backups and disaster recovery, but like you I found that to be the right trade-off. I figure that a machine becoming inaccessible for any reason is a rare event, and I can redeploy a customer to a new machine with their data in a matter of minutes. If that were to become a problem, I could reduce that to seconds, but it hasn’t been a problem.
I gather that you are considering one app per customer, each with one machine. I went with one app, with one machine per group of events, where a customer may have multiple events. When a request comes in to a wrong machine, I either replay or reverse proxy it to the right machine.
I limit myself to one machine per region with the following simple but effective code: showcase/bin/deploy at 1f9eeb3a857e5797fab3b7164307e7c514e89e24 · rubys/showcase · GitHub
In my case, it is one machine per region. That’s because I’m an order of magnitude smaller than you, and do host a small number of geographically close customers (each with potentially multiple events) together on one machine, with one running instance of my app per event.
But you could do the same thing with Dynamic Machine Metadata. Or, if you go with one app per customer, you won’t have to worry about metadata or even regions.