Service Interruption: Can't Destroy Machine, Deploy, or Restart

Hey team, the health checks on our fly app were failing this morning and so I’ve logged in to diagnose this. I can see the machine has the “stopped” state, and there is a little banner saying:

Service Interruption 3 hours ago
We are performing emergency maintenance on a host some of your apps instances are running on.

and no other information. I can’t see an issue on the status page and not sure where else to look for resolution steps or ETA?

I’m unable to restart or destroy the stopped machine instances (it times out), and trying to re-deploy the app throws an error:

Error: found 1 machines that are unmanaged. `fly deploy` only updates machines with fly_platform_version=v2 in their metadata. Use `fly machine list` to list machines and `fly machine update --metadata fly_platform_version=v2 <machine id>` to update individual machines with the metadata. Once done, `fly deploy` will update machines with the metadata based on your fly.toml app configuration

Not sure what else to try at this point short of re-creating a new app or putting it up on another host. Any info you can provide?

3 Likes

We also experienced this. One of our machines seems to have gone zombie mode, reporting unreachable for all fly machine commands with “Error: could not get machine.”

I’ve been able to restore availability to our app by using fly scale to allocate more VMs. However, the bad VM continues to exist in an indeterminate state, can’t be destroyed or removed from the account. fly machine list shows invalid data for the VM such as creation date of “1970-01-01T00:00:00Z”

I would appreciate any advice on how to remediate this.

We’re seeing the same issue in our app, as of nearly 14 hours ago - and ours is for the database container, so it’s not as simple as scaling up to restore access :frowning:

We have a paid plan so I emailed support several hours ago, but no reply as of yet.

For the record, our app is hosted in syd, so maybe one or two hosts are having issues there?

I’m also getting this - just like @mfwgenerics, I worked around this by scaling to create a new machine, but still have the original machine in a state where it can’t be destroyed:

Error: could not get machine [machine ID]: failed to get VM [machine ID]: unavailable: dial tcp [ipv6]:3593: connect: connection refused

My staging environment is in the same state, but until I’m not adding more VMs that I’m surely going to be billed for until I know I can clean up the zombies.

Just like OP, I see the same error in the dashboard about emergency maintenance. That’s been there for 15 hours, with no other information.

This is the second time I’ve had this kind of issue with Fly, where my service just goes down, Fly reports everything healthy, and there’s literally no information and nothing I can really do other than wait and hope it comes back up sometime (hours later, probably).

I appreciate the convenience that Fly offers, but these kind of problems erode my trust in this platform completely. Heroku had it’s faults, but I was never left scouring a forum trying to get my service back up - if a host was unhealthy, my dyno would be automatically moved, no worries. I’m running a small-scale golden path Rails app with Postgres, I can’t imagine trying to fix these kinds of problems on a more complex app.

2 Likes

Adding some more information here since I’m also surprised that this is still ongoing 12 hours later with no response.

  • We had four machines (app + Postgres for staging and production) running yesterday, and three of the four (including both databases) are still down and can’t be accessed. I can replicate the issues others have mentioned here.
  • This is our company’s external API app and so the issue broke all of our integrations.
  • Our team ended up setting up a new project in fly to spin up an instance to keep us going which took a couple of hours (backfilling environment variables and configuration etc, not a bad test of our DR ability).
  • There is no way I can find to get the data from the db machines. Thank goodness this isn’t our main production db and we were able to reverse engineer what we needed into there.

Very keen to hear what’s happening with this and why after so many hours there’s no more info or updates.

As an aside, it’s kind of a kick in the teeth to see the status page for our organization reporting no incidents - the same page that lists our apps as under maintenance and inaccessible!

1 Like

Confirming my deployment is in syd too. I’m still seeing the zombie VM and observing failing CLI commands against the machine.

We have syd deployments as well for all our apps too

I’m feeling very lucky that none of our paid production apps or databases are affected currently (only our development environment is), but also really surprised that the issue has been ongoing for 17 hours now with no status page update, no notifications (beyond betterstack letting us know it was down) and one note on the app with not much info as to whats going on.

It really worries me what would happen if it was one of our paid production instances that was affected - the data we’re working with can’t simply be ‘recovered’ later, it’d just get dropped until service resumed or we migrated to another region to get things running again

Keen to know whats wrong and whats being done about it

2 Likes

Message has now been updated

Service Interruption (20 hours ago)
We are continuing to investigate an infrastructure related issue on this host.

Still no incidents listed on status page though for SYD region :thinking:

1 Like

not sure if connected but had a redis app in lhr fall into suspended status overnight, killed an important demo

machine is a zombie…

machine [id] was found and is currently in a stopped state, attempting to kill…
Error: could not kill machine [id]: failed to kill VM [id]: failed_precondition: machine not in known state for signaling, stopped

I got a response from support a few hours ago -

Unfortunately this host managed to get into a extremely poor state, and a fix is taking longer than expected. We have a team continuing to work on it, but no estimated resolution time to share right now. As soon as we have an update we will let you know.

So I guess we just wait…

1 Like

Same issue here for me, on a host in syd. It’s completely broken a pg cluster.

The absence of any proactive status updates on this issue has been really poor.

Thank you for sharing that update, surprised there is no status update from Fly yet though :cold_sweat:

I can appreciate the issue might be taking up a lot of time and they want to focus on fixing it first - but even just a message from the staff here earlier would put me at ease for our production apps that are running

We worked out we could create a new Postgres cluster from one of the snapshots of the currently-down app - so we’re back up and running for our app.

(We had to create it with a different name, and then when we tried to make another one with the previous name, flyctl put the cluster on the same currently-down host! Oops)

Also having this issue. Scale worked for the Phoenix server, but the Postgres server is also dead.

And I can’t even restore the Postgres one:

Error: failed to create volume: Couldn't allocate volume, not enough compute capacity left in yyz

There’s a known incident listed on the status page for YYZ, might be related. Fly.io Status - We are undergoing emergency vendor hardware replacement in YYZ region.

Still crickets for the down host in SYD though :frowning:

Yes, mine has been fixed.

Really weird how radio silent it’s been on it :thinking: we’re coming up on 48 hours now

Just a note that the status update for me now states that the service interruption was resolved 7 hours ago:

Service Interruption resolved 7 hours ago
We are continuing to investigate an infrastructure related issue on this host.

I still had to manually restart the machine to actually bring my app back up, but I’ve been able to actually interact with machines now, so I guess it is resolved.

Never heard anything from Fly, just complete silence. No status page updates either. I’m sympathetic to the problems I can imagine Fly having to scale their service and support, but the takeaway I have from this experience is that if something outside my own control happens with Fly, there’s nothing I can do to find out what’s going on, when it’s going to be resolved, and if there’s anything I can do to resolve the issue. It sounds like even the paid email support has a multi-hour response, and even then it’s just going to be a “we’re working on it”. I can’t recommend Fly professionally with that kind of experience, and I’m not sure if I can even tolerate it for personal apps.

3 Likes

Update: my bad VM has finally been restored after a couple of days.

I am concerned about the lack of clarity and communication around what happened but I’m happy to put this situation down to growing pains on Fly’s part. I think I’ll be sticking to non-critical, non-stateful workloads for the near term though. :sweat_smile:

1 Like