It has now started working. According to my monitoring, it was down for about 35 minutes, roughly starting at the time I deployed a new version. I can’t see anything relevant in logs.
Any suggestions of what went on, or how to investigate further would be greatly appreciated - obviously this is quite concerning!
This was likely caused by our new state propagation infrastructure. It’s still early days with these changes, but it looks better since a few days ago.
In future do you have suggestions for debugging an issue like this, or checking status of fly services? It’s quite alarming having a production outage with no way to diagnose, or a way to confirm it’s an external problem.