We’ve updated our statuspage, but want to make sure everyone is aware about an upcoming maintenance window.
One of our upstream providers in the SEA region will be performing emergency maintenance today, Monday 20th, 2023 at 22:45 UTC. The maintenance window is 5 hours. We are adding additional server capacity elsewhere in the SEA region but Machines and Apps with volumes will not be automatically migrated and will experience a service interruption during this time period.
Based on discussion with the provider we’re hopeful the communicated 5 hour window is overly cautious. It’s likely we’re only going to lose this subset of hosts for less than an hour. Regardless of 5 minutes or 5 hours, we know this sucks and we’re taking steps to make it less painful.
Right now we’re adding capacity in a separate SEA datacenter so that we can drain applications from the affected servers and bring them back on new servers. We don’t have a way to do this automatically for Machines and Apps that use volumes yet. Around 30% of the applications in this region have volumes attached and are likely to be impacted.
Since this is the forum, we can speculate a little. We got a heads up about emergency power maintenance a few hours ago. There are some interesting details here: first, the generators aren’t going to kick in. Whatever they need to fix is between the generators and the actual hardware.
Given the huge disruption (this is not a tiny facility, this is wildly disruptive for everyone using it), we think they failed an IR test or some other diagnostic that made them think they’re at high risk of a fire. A five hour maintenance window with is probably preferable to a fire, all things considered.
Thanks for the update. If my only Postgres volumes for a very low traffic app were in the SEA region and my Phoenix app failed over to SJC but none of my volumes can be connected to, would it help to try to restore in a different region from a snapshot or would the snapshots also be in SEA?
Realtime edit: My app stopped throwing Postgrex connection errors at 2023-03-20T21:29:44Z so maybe it is fixed?
Just saw the status flash in the dashboard “We’re addressing an incident that affects one or more of your apps.”, is this a generic messaging or does it really apply to some of the apps? AFAIK i’m only using ORD region and was surprised to see SEA. Same status update is listed on the app page, which has nothing to do with SEA region.
You’re right, that language is a bit confusing. Incidents on our public status page (https://status.flyio.net/) are appearing in the UI now.
The maintenance incident specifically that you’re referring to was very region specific. If you’re in ORD only, then you’re not affected and that message is probably causing more confusion than it’s solving. We’ll iterate on some ideas to make that messaging more clear, especially for regional issues.