SSH Connection issues. 11h and counting ... + Uptime Metrics

johannes · June 23, 2025, 7:55pm

Hey there,

Since this morning, I haven’t been able to connect to most of my apps. I’m seeing a variety of errors throughout the day, for example:

fly ssh console -a xxx
Connecting to tunnel 🌏Error: tunnel unavailable: Error contacting Fly.io API when probing "xxx": timed out (context deadline exceeded)

fly wireguard create
Error: Post https://api.fly.io/graphql: EOF

fly ssh console -a xxx
Error: ssh: can't build tunnel for patch-work: server returned a non-200 status code: 503

fly ssh console -a xxx
Connecting to xxxxxxx... complete
Error: error connecting to SSH server: dial: connect tcp [xxxxx]:22: operation timed out

Error: ssh: can't build tunnel for xxx: Post "https://api.fly.io/graphql": read tcp [xxx:64749->[xxxx]:443: read: no route to host

Are all of these issues related to the ongoing GraphQL API problems?

Additionally, I’d like to better understand how uptime is defined on the status page. It currently shows values like 100% for June and 99.98% for May. However, I’ve personally experienced several hours of downtime every month — for example, I haven’t been able to access one of my apps for over 11 hours today (an uptime below 98.47% for the month).

Could you clarify how uptime is calculated for Fly.io deployments? Does it include user-facing service accessibility, or is it measured differently?

Thanks in advance.

halfer · June 23, 2025, 8:06pm

What region are you in? Are you still having problems right now? I just did a fly ssh console into a London machine, just in the last two minutes. I am wondering if there is an issue with your machine that has upset the networking stack.

(Yes, it could also be Fly, but I imagine an 11 hour networking outage would have alerted engineers if it was widespread).

Could you spin up a new machine in the same region and see if you can get a console on that?

johannes · June 23, 2025, 8:14pm

The one that’s been unreachable since this morning is in AMS Region.

Just tried again and got a new error:
Error: ssh: can't build tunnel for xxx: Post "https://api.fly.io/graphql": read tcp [xxx:64749->[xxxx]:443: read: no route to host

Yes I can create new apps in AMS. But cannot access this existing app.

halfer · June 23, 2025, 8:25pm

Righto. Is your app alive at all? If the OS has completely crashed, then it could be that it cannot start up an SSH listener. Can you reboot it via the web console?

mayailurus · June 23, 2025, 10:26pm

My understanding is that those numbers are just artifacts of the third-party status portal software—and not something that Fly.io intends you to take very seriously. (I don’t speak for Fly.io at all, but this is based on what they’ve said themselves in the past.)

Basically, what they do want from the status page is a real-time messaging box, with some ability to view previous messages. These are written in the heat of the moment—while people are still heads-down in debugging and bringing systems back to life—and no one is going to completely nail the exact wording, beginning and ending times, particular list of affected subcomponents, etc., under such circumstances.

For now, the only way to get an idea of overall platform reliability is qualitatively, by reading the prose in the Infrastructure Log. In contrast to the previous paragraph, this one is written in the calm and reflection of a week or more’s distance after the event, , having collated everyone’s views, examined internal logs and monitoring feeds, thought extensively about who would have been affected and how, and then assembled a consensus plan for avoiding such problems in the future.

Some day, this will be complemented by an actual, credible numerical graph, based on automatically collected and reported, objective system measurements. The new (and excellent) capacity API is a small step in that direction. (Although obviously some generous soul would need to set up a Machine to poll it every 10 minutes, so there’s readily available history on tap.)

No one is super-happy with the status quo, but hopefully this clears things up a bit!

johannes · June 23, 2025, 11:48pm

yes, the App is live. No errors there at all. Just SSH not working.

@mayailurus thank you for this detailed response and for sharing the infra log. I wasn’t aware of this before.

johannes · June 25, 2025, 10:53am

Still can’t connect via SSH. Tried from multiple devices/ VPN connections.
Are SSH issues fly’s responsibility or mine?

Connecting to xxx:2... complete
Error: error connecting to SSH server: dial: connect tcp [xxx:2]:22: operation timed out

fly doctor -a xxxx
Testing authentication token... PASSED
Testing flyctl agent... PASSED
Testing local Docker instance... Nope
Pinging WireGuard gateway (give us a sec)... PASSED
Testing WireGuard DNS... PASSED
Testing WireGuard Flaps... PASSED

App specific checks for xxxx:
Checking that app has ip addresses allocated... PASSED
Checking A record for xxxx.fly.dev... PASSED
Checking AAAA record for xxxx.fly.dev... PASSED

Build checks for xxxx:
Checking docker context size (this may take little bit)... PASSED (17 MB)
Checking for .dockerignore... PASSED

What I have tried so far:

flyctl wireguard list
flyctl wireguard reset
flyctl wireguard remove

fly wireguard websockets disable
fly wireguard websockets enable

flyctl auth logout
flyctl auth login

fly agent restart

halfer · June 25, 2025, 9:42pm

I’d regard this as sufficiently like an “accounts problem” that you could email Fly on their billing alias.

It’s Fly’s; it is baked into the platform.

system · July 2, 2025, 9:43pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.