I have two public-facing App running on Fly.io (web apps + REST APIs for mobile apps).
Yesterday, one started going down, answering fewer and fewer calls. Then the second went down the same way, for about 30 minutes. Meanwhile, the Fly.io dashboard was just as unstable as my Apps. For a while, restarting machines from the dashboard (when I could reach the Machines page) seemed to help a bit.
All of a sudden, it all came back online. Good. However, no issues were ever mentioned on the status page or in an email. All I got was an email today about upcoming maintenance in CDG (where both my apps run one instance), which sounds like a crazy coincidence, but who knows.
The most concerning part is, the exact same scenario just happened again.
May I please have feedback about the two major outages happening in just two days (and affecting the Fly.io dashboard too, at least when reaching it from France!), to reassure me it won’t happen again tomorrow?
My apps run in the CDG and SJC regions. I have been using Fly.io for years and had never experienced anything like this.
PS: My App logs didn’t show any errors, there were simply much fewer logs all of a sudden as people stopped making requests because the servers were down. The only info I could get were client-side logs, stating “OS Error: Connection reset by peer” and associated to a port number that kept increasing with each error (e.g. 51225, then 51244).

