fly.io site is currently inaccessible...

simoncocking · November 26, 2024, 6:41am

$ until fly ssh console -a myapp --pty --machine machine-id; do sleep 1; done

henrycatalinismith · November 26, 2024, 7:03am

Wow this looks like it was a rough night for the team at Fly. It was shaky when I went to bed and now looking again after waking up, you’ve been working on this the whole time I’ve been asleep.

This incident ruined my evening too a little bit yesterday. I mistook the side effects of the platform incident for operational issues with my individual site and spent a good hour or so attempting various fixes before I decided to check the Fly status page. I even restored the database from a backup at one point.

Based on that experience, one little idea I’d love to contribute to the post-portem discussions is the possibility of having the fly CLI print that same helpful “We’re addressing an incident. Please check the status page for more details” message that the dashboard shows during an incident. For many of us the CLI is the dashboard after all.

Good luck getting everything back in working order. Looks to have been a nasty one.

kylemclaren · November 26, 2024, 7:16am

Ordinarily it should. See this PR:

github.com/superfly/flyctl

Show warning for active incidents

superfly:master ← superfly:aschiavo/statuspage

opened 08:53AM - 03 Jun 24 UTC

aschiavo

+219 -0

### Change Summary This change adds a background task that queries status.flyio….net for unresolved incidents for every command. If any incident is returned, it will show a warning to the user. The warning includes instructions to either check statuspage or execute the specific command for listing the incidents which provides extended information. How: - Adding a background task that is enqueued from a common preparer so it's executed for every command. It can be disabled by setting `FLY_NO_INCIDENTS_CHECK` env var to a truthy value. The check is ignored for the more specific `incidents list` command. - Adding a `incidents list` command which will provided detailed information about the unresolved incidents. ### Documentation - [ ] Fresh Produce - [ ] In superfly/docs, or asked for help from docs team - [x] n/a

But if the flyctl command itself is timing out or otherwise erroring I could see how that can fail.

rodolfo · November 26, 2024, 7:18am

I moved to fly.io 3 weeks ago, and there are so many good things about the platform, but this whole event is one of those things that is hard to come back from. The company I am consulting for had some reservations about having their infra here based on the previous outages, and it is going to be very hard to convince them otherwise.

Danecki · November 26, 2024, 7:22am

I also cannot deploy anything with

Error: server returned a non-200 status code: 504

message

thunderbolt.sanchez · November 26, 2024, 7:25am

For real. Got an account maybe a couple months ago…kicked the tires. Then about 10 days ago started in earnest to build my infra here. Bad timing.

Stumbled on this a few weeks ago when I was deciding on whether or not to take the plunge. Looks like it’s still a struggle with the ‘Status paging’ bullet. And that post by the CEO was over 1.5 years ago. I’m thinking things are more beta than release.

We’re all in the dark…our infra’s been down the same way all day despite whatever words are on the update page. And now, I guess there’s at least a few like me that are just watching out of morbid curiosity, watching to see if it can hit the 24-hour mark.

ACPixel · November 26, 2024, 7:44am

Things seem to be slowly coming back online on my end. knock on wood

thunderbolt.sanchez · November 26, 2024, 8:06am

Same. I can refresh deployments/images.

simoncocking · November 26, 2024, 8:20am

The API does appear to be responding normally, but our FLAME runners are still fubar. Instances appear to start but are unresponsive, timing out after 30 seconds.

henrycatalinismith · November 26, 2024, 8:22am

Hmmm. Status page says everything’s fine now. My app was still down from last night’s failed deploy so I’ve tried to redeploy it. That didn’t work, so I tried scaling it down to 0 instances and back up again. That didn’t work so I’ve scaled back down to 0 to try to get a clean slate to restore from.

Now my app is in a state where fly machine list says No machines are available on this app. Meanwhile if I run fly logs I can see my application’s usual background noise log chatter running actively on a machine ID that fly machine status claims doesn’t exist any more.

Maybe the fire is out so to speak but it seems there is some cleanup work ahead.

mingrammer · November 26, 2024, 8:29am

I can deploy now too.

hosty · November 26, 2024, 8:31am

I am still getting failed builds and unable to access my app. I get a failed build after 33s.

hosty · November 26, 2024, 8:35am

For what it’s worth, my deployment is now giving a 502 when visiting instead of 504.

simoncocking · November 26, 2024, 8:36am

Only 302 more visits to get a 200

hosty · November 26, 2024, 8:52am

I hope they are still looking at it and update status page. Pretty unacceptable coming from them to placate that way. Although some may be resolved they should check active community forums and issues before marking something as resolved.

fredwu · November 26, 2024, 9:19am

I sincerely hope they will also update their uptime: Fly.io Status - Uptime History

Still showing “No downtime recorded on this day.” as of now.

scottyeung · November 26, 2024, 9:52am

CPU load maxed out once I redeployed again

antondyad · November 26, 2024, 10:35am

This is more of “+1” / “same here” reply.
We’re having inexplicable issues with our apps too, tried different regions.

Even fly machines list can return different results 1 sec apart (sometimes “no machines”, sometimes the correct list).

tomc · November 26, 2024, 10:51am

Another +1 here. We’re also still experiencing issues with at least one of our apps (an internal service that appears to be unreachable by other apps despite Fly saying the app is running).

We’ve been running production apps on Fly just over a year now and love many things about the platform when it works, but the severity and duration of this incident is making us rethink using Fly going forward. Each of our apps runs at least 2 machines to offer some level of redundancy in case of failure, but that’s irrelevant if the entire cloud service is down.

matthewsinclair · November 26, 2024, 10:52am

I have the same issue since last night. An old app that I have not touched for a while is working. But the app I am currently working on and which I have been trying to deploy since last night will not come up and stay up. There’s a long Slack discussion here: Slack

Topic		Replies	Views
Something went wrong? Questions / Help	42	1431	September 22, 2022
Service unavailable? Unable to deploy django app or login	18	549	September 16, 2023
Fly API down?	1	332	March 28, 2022
Fly.io apps down in production	3	324	October 17, 2022
Fly.io machine is down again - another incident? builders	15	339	November 5, 2024

fly.io site is currently inaccessible...

Related topics