Downtime started after I deployed, and stayed down after repeated re-deploys.
I periodically tried deploying or changing region through the downtime. Moving from sea to lax eventually brought it back but required manual intervention.
We’ll get more details once we’ve completely recovered. Everything should be good today, but we’re still digging to make sure we’ve fully diagnosed the problem. At the moment, we’re primarily focusing communications on customers with premium email support.
This specific failure would be hard to prevent. Our gossip based service discovery had issues propagating information after deploys. Apps that didn’t get deployed continued to work fine, but some percentage of deploys corrupted their service data.
The new Machines based apps we’re shipping will be more resilient to this kind of problem, since deploys won’t churn service discovery data, but it’s not a complete fix.