emergency maintenance for 3 days?

Hi one of my apps has been saying “We are performing emergency maintenance on a host some of your apps instances are running on. Apps may be unavailable until the maintenance is completed. - Service Interruption 2 days ago”

What can I do to get this app back online? This app has been up and running fine for 5-6 months without issue until 2 days ago.

If I try to restart it, fly apps restart I get:

Error: server returned a non-200 status code: 503.

I tried re-deploying it which didn’t go any better:

Error: found 1 machines that are unmanaged. `fly deploy` only updates machines with fly_platform_version=v2 in their metadata. Use `fly machine list` to list machines and `fly machine update --metadata fly_platform_version=v2 <machine id>` to update individual machines with the metadata. Once done, `fly deploy` will update machines with the metadata based on your fly.toml app configuration

Which makes no sense since I’m on v2 and I can’t even get the machine status:

fly machine status 7811694b954928
Error: could not get machine 7811694b954928: failed to get VM 7811694b954928: invalid machine ID, ‘7811694b954928’ even though this is the ID pulled from flyctl machine list.

What can I do here?

Thanks

Have the same problem

cc @kurt

Same here…

1 Like

Today I see that they’ve added:

We have a registry that only keeps storage in the iad region, which we think will work around the issue you are seeing. Try it out with: FLY_REGISTRY_HOST=registry-iad.fly.io fly deploy Use this if you are doing docker push directly to the registry: FLY_REGISTRY_HOST=registry-iad.fly.io fly auth docker docker push registry-iad.fly.io/torem-app:latest We are working to fix the core issue. We’ll report back once that’s done. In the meantime, we hope the above unblocks deployments.

However neither of these works for me, this app is on the AMS region, I’m getting the error:

Error: failed to fetch an image or build from source: error connecting to docker: failed building options: failed probing "personal": context deadline exceeded

and

FLY_REGISTRY_HOST=registry-iad.fly.io fly auth docker docker push registry-iad.fly.io/torem-app:latest
Error: failed authenticating with registry-iad.fly.io: Error saving credentials: error storing credentials - err: exit status 1, out: `Post "http://ipc/registry/credstore-updated": dial unix backend.sock: connect: no such file or directory`

And now just simply:

Error: failed to update VM 7811694b954928: unknown: deploys to this host are temporarily disabled

Hey @holden, unfortunately your app only had one Machine with no standby Machines. The Registry issue is different to the underlying host issue affecting your app (see your Personalized Status page for details on that). The team is working on the host issue, but there is currently no ETA to share.

A fresh deploy should work to bring up a Machine on a working host. It seems a fix for the Registry issue has been implemented but I would run the deploy with LOG_LEVEL=debug prepended, which should give you some more useful debugging info if it fails again.

I ran the LOG_LEVEL=debug but I can’t say it gave me much more.

There’s also not much on the personalized status page it just says

  1. 2023-09-01 23:36:17 UTCWe are performing emergency maintenance on a host some of your apps instances are running on. Apps may be unavailable until the maintenance is completed.
DEBUG {
  "query": "\n# @genqlient\nmutation MachinesUpdateRelease ($input: UpdateReleaseInput!) {\n\tupdateRelease(input: $input) {\n\t\trelease {\n\t\t\tid\n\t\t}\n\t}\n}\n",
  "variables": {
    "input": {
      "clientMutationId": "",
      "releaseId": "VRl7314XXwvlkHbGLJkk51Z2",
      "status": "failed"
    }
  },
  "operationName": "MachinesUpdateRelease"
}

DEBUG {0x140011f98c0}
DEBUG <-- 200 https://api.fly.io/graphql (362.34ms)

DEBUG {
  "data": {
    "updateRelease": {
      "release": {
        "id": "VRl7314XXwvlkHbGLJkk51Z2"
      }
    }
  }
}

DEBUG Task manager done
Error: failed to update VM 7811694b954928: unknown: deploys to this host are temporarily disabled, please try again later or check the status page: https://status.flyio.net

This is a very simple app, it’s a docker image of ghost, the dockerfile is basically one line:

FROM ghost:5.60.0-alpine

In this case, I think that the easiest route would be deploying to a new app. Additionally, I would recommend that you ensure the app has a standby Machine for resiliency (though that is the default for new apps).

Are the VM machines in the same region created in different hardware machines?

Our app has also been broken in Amsterdam for almost 24hs now:

2023-09-05T16:13:27.348 runner[286560eae70de8] ams [info] machine exited with exit code 0, not restarting

could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? has your app's instances all reached their hard limit?)

And trying to redeploy the app fails, too:

$ fly deploy
==> Verifying app config
Validating /Users/foo/dev/app/remix/fly.toml
Platform: machines
✓ Configuration is valid
--> Verified app config
WARN DATABASE_URL may be a potentially sensitive environment variable. Consider setting it as a secret, and removing it from the [env] section: https://fly.io/docs/reference/secrets/

==> Building image
Waiting for remote builder fly-builder-falling-smoke-8492... 🌎WARN The running flyctl agent (v0.1.81) is older than the current flyctl (v0.1.83).
WARN The out-of-date agent will be shut down along with existing wireguard connections. The new agent will start automatically as needed.
WARN Failed to start remote builder heartbeat: failed building options: agent: failed to start

Error: failed to fetch an image or build from source: error connecting to docker: failed building options: agent: failed to start
The agent failed to start with the following error log:



A copy of this log has been saved at /Users/foo/.fly/agent-logs/339000168.log

In the end I did what @kylemclaren recommended and just deleted the app completely and redeployed a fresh app. It’s all working now.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.