Unable to deploy, local deployment gives 504 error, machine stuck at replacing

Not really sure what’s going on here. I’ve been using Fly for nearly 12 months, and have had zero problems deploying during that time with an existing project. I’ve created a new org and a new app, and the first deployment goes through, but any subsequent attempts to deploy result in the app becoming suspended, with the machine stuck in the “replacing” state.

The logs in the dashboard for that app are littered with

2023-07-13T09:57:32.638 runner[6e82d379c43dd8] lhr [info] Pulling container image registry.fly.io/shiftsync-api:deployment-01H57AAAXV82QMMYMB7GPHZZE8

There was also one instance of this error

2023-07-13T09:48:24.983 runner[6e82d379c43dd8] lhr [error] Pulling image failed, retrying...

And then it just ended up giving me all the same “Pulling container image” logs again.

My deployments are done via a Github Action, and the logs for those are showing

image size: 391 MB
  [1/1] Updating 6e82d379c43dd8 [app]
Updating existing machines in 'shiftsync-api' with rolling strategy
Error: failed to update VM 6e82d379c43dd8: request returned non-2xx status, 504

Error: Process completed with exit code 1.

I get the same thing if I try and deploy from my PC as well.

I’ve tried using the --local-only and --remote-only flags, but neither seems to make a difference. When doing it remote only, I have also tried deleting the builder app and letting it get recreated, but this also makes no difference.

In an act of desperation, I destroyed the app and created a new one, and the deploy went through, making me think it had just been an odd case with the previous app I’d created, but now once again, any subsequent deployments are failing with all of the above again.

I’ve tried running LOG_LEVEL=debug fly deploy --local-only, but honestly, I’m not sure what I’m looking for there. I had seen someone say it might tell me what request is giving the 504 error, but there’s nothing I can see that includes that.

This is becoming problematic for me because while it’s in this state, the app is not accessible, and it doesn’t seem like I can even roll back.

Is there anything else I can do or try?

2 Likes

I’m seeing a similar problem.

--> Pushing image done
image: registry.fly.io/my-app
image size: 556 MB

Watch your app at https://fly.io/apps/my-app/monitoring

Running my-app release_command: bin/rails fly:release
  Updating release_command machine e784e679b50108
  Waiting for e784e679b50108 to have state: stopped
Error: release command failed - aborting deployment. error running release_command machine: timeout reached waiting for machine to stopped failed to wait for VM e784e679b50108 in stopped state: Get "https://api.machines.dev/v1/apps/my-app/machines/e784e679b50108/wait?instance_id=01H576V6RQ2T1ED17SXP9PYXEK&state=stopped&timeout=60": net/http: request canceled
note: you can change this timeout with the --wait-timeout flag

It’s not my release command (which just runs DB migrations, this release has no new migrations) and has been happening consistently since our last successful deploy yesterday at 16:42 GMT.

2 Likes

Hello,

I’m having what I think might be the same problem with my (old, nomad-based) postgres cluster. In my case the logs for the app display this:

2023-07-13T12:39:13Z app[a1a6916e] ams [info]keeper   | 2023-07-13T12:39:13.180Z	FATAL	cmd/keeper.go:2118	cannot create keeper: cannot create store: cannot create kv store: Unexpected response code: 500 (No cluster leader)
2023-07-13T12:39:13Z app[a1a6916e] ams [info]keeper   | exit status 1
2023-07-13T12:39:13Z app[a1a6916e] ams [info]keeper   | restarting in 5s [attempt 1]
2023-07-13T12:39:13Z app[a1a6916e] ams [info]sentinel | 2023-07-13T12:39:13.448Z	FATAL	cmd/sentinel.go:2030	cannot create sentinel: cannot create store: cannot create kv store: Unexpected response code: 500 (No cluster leader)
2023-07-13T12:39:13Z app[a1a6916e] ams [info]sentinel | exit status 1
2023-07-13T12:39:13Z app[a1a6916e] ams [info]sentinel | restarting in 3s [attempt 1]
2023-07-13T12:39:13Z app[a1a6916e] ams [info]panic: error checking stolon status: cannot create kv store: Unexpected response code: 500 (No cluster leader)
2023-07-13T12:39:13Z app[a1a6916e] ams [info]: exit status 1
2023-07-13T12:39:13Z app[a1a6916e] ams [info]goroutine 9 [running]:
2023-07-13T12:39:13Z app[a1a6916e] ams [info]main.main.func2(0xc0000ce000, 0xc000082a80)
2023-07-13T12:39:13Z app[a1a6916e] ams [info]	/go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c
2023-07-13T12:39:13Z app[a1a6916e] ams [info]created by main.main
2023-07-13T12:39:13Z app[a1a6916e] ams [info]	/go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b
2023-07-13T12:39:14Z app[a1a6916e] ams [info] INFO Main child exited normally with code: 2
2023-07-13T12:39:14Z app[a1a6916e] ams [info] WARN Reaped child process with pid: 260 and signal: SIGKILL, core dumped? false
2023-07-13T12:39:14Z app[a1a6916e] ams [info] WARN Reaped child process with pid: 263 and signal: SIGKILL, core dumped? false

Trying to restart it gives the same option.

I tried restoring the snapshot to a different volume with fly postgres create --snapshot-id, and also trying to create a new volume on a different machine from a snapshot (fly volumes create pg_data --snapshot-id vs... --size 10), but the result is the same:

Error: failed creating volume: server returned a non-200 status code: 504

I also tried upgrading the app to v2, still giving me an error.

I don’t know what to do. I would at least need to be able to access these snapshots to migrate the data to a different instance.

Been facing the same issue for the past several days. From what I saw, container pulls ran very slowly. Sometimes the deployment ended up succeeding after a good 10-15 minutes wait, other times they just couldn’t start at all.

Seeing the same issue in LHR over the past 48 hours. Constant timeouts after 60 secs on a relatively small (1.1 GB) image.

Hi All,

We identified an issue affecting two hosts in lhr. This was causing machine updates to take longer than usual and/or timeout. We believe the issue should be resolved. If you retry your deployments they should succeed now.

Apologies for the inconvenience. We’ve added additional logging on these hosts that will help catch and prevent these issues in the future.

Hi Sam,
Error: failed to update VM e2865d47bd2e86: request returned non-2xx status, 504
as am deploying it in ‘SIN’ its frequently giving me this error on deployment through github action

1 Like

I am getting the same issue when deploying from fly cli. request returned non-2xx status, 504

I’m unable to restart as it won’t shutdown.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.