The application was only running in the iad region which is probably the reason why it went down.
That being said it would be nice to have advanced notice of “scheduled maintenance” before an application goes actually down.
What worries me as well is that the issue on the status page started at 00:07 CST, but increased latency started actually more than an hour earlier at 22:50 pm CST.
Can you confirm that simply scaling the application to 3 or more instances across different regions would have avoided this issue? Will the fly proxy automatically stop serving instances with increased latency or that are down?
We do our best to post information about maintenance events scheduled by our upstream datacenter providers whenever we think there will be any potential impact, and we actively monitor several status feeds to make sure we’re covered.
This particular maintenance event slipped through unnoticed until I was alerted to a performance issue in our API, and I updated the status page with information about the ongoing maintenance as soon as I discovered it. I agree advance notice would have been nicer, apologies for the inconvenience.
Yes, scaling an application across different regions should help avoid issues like this that impact a single region- fly proxy will do a good job of routing requests to instances that are passing health checks and not overloaded with requests.