Since this morning, I haven’t been able to connect to most of my apps. I’m seeing a variety of errors throughout the day, for example:
fly ssh console -a xxx
Connecting to tunnel 🌏Error: tunnel unavailable: Error contacting Fly.io API when probing "xxx": timed out (context deadline exceeded)
fly wireguard create
Error: Post https://api.fly.io/graphql: EOF
fly ssh console -a xxx
Error: ssh: can't build tunnel for patch-work: server returned a non-200 status code: 503
fly ssh console -a xxx
Connecting to xxxxxxx... complete
Error: error connecting to SSH server: dial: connect tcp [xxxxx]:22: operation timed out
Error: ssh: can't build tunnel for xxx: Post "https://api.fly.io/graphql": read tcp [xxx:64749->[xxxx]:443: read: no route to host
Are all of these issues related to the ongoing GraphQL API problems?
Additionally, I’d like to better understand how uptime is defined on the status page. It currently shows values like 100% for June and 99.98% for May. However, I’ve personally experienced several hours of downtime every month — for example, I haven’t been able to access one of my apps for over 11 hours today (an uptime below 98.47% for the month).
Could you clarify how uptime is calculated for Fly.io deployments? Does it include user-facing service accessibility, or is it measured differently?
What region are you in? Are you still having problems right now? I just did a fly ssh console into a London machine, just in the last two minutes. I am wondering if there is an issue with your machine that has upset the networking stack.
(Yes, it could also be Fly, but I imagine an 11 hour networking outage would have alerted engineers if it was widespread).
Could you spin up a new machine in the same region and see if you can get a console on that?
The one that’s been unreachable since this morning is in AMS Region.
Just tried again and got a new error: Error: ssh: can't build tunnel for xxx: Post "https://api.fly.io/graphql": read tcp [xxx:64749->[xxxx]:443: read: no route to host
Yes I can create new apps in AMS. But cannot access this existing app.
Righto. Is your app alive at all? If the OS has completely crashed, then it could be that it cannot start up an SSH listener. Can you reboot it via the web console?
My understanding is that those numbers are just artifacts of the third-party status portal software—and not something that Fly.io intends you to take very seriously. (I don’t speak for Fly.io at all, but this is based on what they’ve said themselves in the past.)
Basically, what they do want from the status page is a real-time messaging box, with some ability to view previous messages. These are written in the heat of the moment—while people are still heads-down in debugging and bringing systems back to life—and no one is going to completely nail the exact wording, beginning and ending times, particular list of affected subcomponents, etc., under such circumstances.
For now, the only way to get an idea of overall platform reliability is qualitatively, by reading the prose in the Infrastructure Log. In contrast to the previous paragraph, this one is written in the calm and reflection of a week or more’s distance after the event, , having collated everyone’s views, examined internal logs and monitoring feeds, thought extensively about who would have been affected and how, and then assembled a consensus plan for avoiding such problems in the future.
Some day, this will be complemented by an actual, credible numerical graph, based on automatically collected and reported, objective system measurements. The new (and excellent) capacity API is a small step in that direction. (Although obviously some generous soul would need to set up a Machine to poll it every 10 minutes, so there’s readily available history on tap.)
No one is super-happy with the status quo, but hopefully this clears things up a bit!
Still can’t connect via SSH. Tried from multiple devices/ VPN connections.
Are SSH issues fly’s responsibility or mine?
Connecting to xxx:2... complete
Error: error connecting to SSH server: dial: connect tcp [xxx:2]:22: operation timed out
fly doctor -a xxxx
Testing authentication token... PASSED
Testing flyctl agent... PASSED
Testing local Docker instance... Nope
Pinging WireGuard gateway (give us a sec)... PASSED
Testing WireGuard DNS... PASSED
Testing WireGuard Flaps... PASSED
App specific checks for xxxx:
Checking that app has ip addresses allocated... PASSED
Checking A record for xxxx.fly.dev... PASSED
Checking AAAA record for xxxx.fly.dev... PASSED
Build checks for xxxx:
Checking docker context size (this may take little bit)... PASSED (17 MB)
Checking for .dockerignore... PASSED