For the past few months we’ve been getting intermittent downtime.
We have 3 or so services sitting behind an nginx proxy on Fly. Every now and then, the all become unavailable at the same time. This leads me to believe it has to be something to do with the proxy on Fly.
I don’t know enough about what the metrics mean, but is there anything obvious in these stats that looks like it could point to an issue?
Can you tell me more about what kind of downtime you’re seeing? Are you running an automated test that’s showing downtime, or seeing something else?
One way to detect whether an issue is on our end or your end is to also monitor debug.fly.dev. If it’s just one app throwing errors, it’s probably not our infrastructure causing it.
Also, how many VMs are you running? Does fly status --all show any failed VMs? Occasionally errors like this are caused by a single VM crashing and restarting.