Deploy stalls out, GitHub Action never completes push to production

I’ve created a fullstack Remix app which has been deployed for a few weeks now at goods.fly.dev. However, in the past few days I haven’t been able to deploy new changes. I push a change and then the GitHub Action fails (I’ve reached out to GitHub as well to ask about this). I don’t know what could have changed to make the deploy not go through.

My fly logs look like this and indicate an issue with healthchecks:


$ fly logs -a goods

Waiting for logs...

2023-01-28T22:20:36.633 app[7f061b8d] ord [info] GET /healthcheck 200 - - 5.955 ms

2023-01-28T22:20:46.642 app[7f061b8d] ord [info] HEAD / 200 - - 3.860 ms

2023-01-28T22:20:46.644 app[7f061b8d] ord [info] GET /healthcheck 200 - - 7.167 ms

2023-01-28T22:21:06.660 app[7f061b8d] ord [info] HEAD / 200 - - 2.968 ms

2023-01-28T22:21:06.662 app[7f061b8d] ord [info] GET /healthcheck 200 - - 6.081 ms

2023-01-28T22:21:16.668 app[7f061b8d] ord [info] HEAD / 200 - - 2.329 ms

2023-01-28T22:21:16.669 app[7f061b8d] ord [info] GET /healthcheck 200 - - 5.577 ms

2023-01-28T22:21:26.676 app[7f061b8d] ord [info] HEAD / 200 - - 2.813 ms

2023-01-28T22:21:26.677 app[7f061b8d] ord [info] GET /healthcheck 200 - - 5.518 ms

2023-01-28T22:21:36.685 app[7f061b8d] ord [info] HEAD / 200 - - 3.011 ms

2023-01-28T22:21:36.686 app[7f061b8d] ord [info] GET /healthcheck 200 - - 6.036 ms

2023-01-28T22:21:46.692 app[7f061b8d] ord [info] HEAD / 200 - - 2.317 ms

Any ideas for troubleshooting or fixing this?

Hi,

I’m not sure it is an issue with healthchecks. Only that log suggests a fast, 200 response code. Which is what you want to see. 200 means all is well.

The question is at what point does it fail. If you click on the Github Action on their page, you can see their log output. It shows a load of detail and should show the steps it takes (npm install, fly deploy, etc). Presumably one of them is failing and as a result, the whole deploy is. Once you see which command is failing, that will indicate how to fix it.

Hey Greg, thanks for the reply. I did find it odd that the response code was 200. I can’t figure out why it just keeps looping back like it does.

Unfortunately the Github Action log is inconclusive. It looks like everything goes well, and then it just never makes it to the next step, times out after a few hours and is cancelled.

This action ran for almost six hours before stopping, and it gives no reason for the cancellation. But maybe you’re right and Github needs to look into this for me.

Here’s the log:

Run superfly/flyctl-actions@1.3

[6](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:7)/usr/bin/docker run --name c523cbaf5f61e41f1a06f704bc1613efb_ee483e --label 49859c --workdir /github/workspace --rm -e "FLY_API_TOKEN" -e "INPUT_ARGS" -e "HOME" -e "GITHUB_JOB" -e "GITHUB_REF" -e "GITHUB_SHA" -e "GITHUB_REPOSITORY" -e "GITHUB_REPOSITORY_OWNER" -e "GITHUB_REPOSITORY_OWNER_ID" -e "GITHUB_RUN_ID" -e "GITHUB_RUN_NUMBER" -e "GITHUB_RETENTION_DAYS" -e "GITHUB_RUN_ATTEMPT" -e "GITHUB_REPOSITORY_ID" -e "GITHUB_ACTOR_ID" -e "GITHUB_ACTOR" -e "GITHUB_TRIGGERING_ACTOR" -e "GITHUB_WORKFLOW" -e "GITHUB_HEAD_REF" -e "GITHUB_BASE_REF" -e "GITHUB_EVENT_NAME" -e "GITHUB_SERVER_URL" -e "GITHUB_API_URL" -e "GITHUB_GRAPHQL_URL" -e "GITHUB_REF_NAME" -e "GITHUB_REF_PROTECTED" -e "GITHUB_REF_TYPE" -e "GITHUB_WORKFLOW_REF" -e "GITHUB_WORKFLOW_SHA" -e "GITHUB_WORKSPACE" -e "GITHUB_ACTION" -e "GITHUB_EVENT_PATH" -e "GITHUB_ACTION_REPOSITORY" -e "GITHUB_ACTION_REF" -e "GITHUB_PATH" -e "GITHUB_ENV" -e "GITHUB_STEP_SUMMARY" -e "GITHUB_STATE" -e "GITHUB_OUTPUT" -e "RUNNER_OS" -e "RUNNER_ARCH" -e "RUNNER_NAME" -e "RUNNER_TOOL_CACHE" -e "RUNNER_TEMP" -e "RUNNER_WORKSPACE" -e "ACTIONS_RUNTIME_URL" -e "ACTIONS_RUNTIME_TOKEN" -e "ACTIONS_CACHE_URL" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/work/goods/goods":"/github/workspace" 49859c:523cbaf5f61e41f1a06f704bc1613efb deploy --image registry.fly.io/goods:main-c1587b56d7f5b5e87a91a67a9154d28faf070c90

[7](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:8)==> Verifying app config

[8](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:9)--> Verified app config

[9](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:10)==> Building image

[10](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:11)Searching for image 'registry.fly.io/goods:main-c1587b56d7f5b5e87a91a67a9154d28faf070c90' remotely...

[11](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:12)image found: img_0lq7472nm7op6x35

[12](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:13)==> Creating release

[13](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:14)--> release v11 created

[14](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:15)Logs: https://fly.io/apps/goods/monitoring

[15](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:16)

[16](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:17)--> You can detach the terminal anytime without stopping the deployment

[17](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:18)==> Monitoring deployment

[18](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:19)

[19](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:20)v11 is being deployed

[20](https://github.com/grahamhagenah/goods/actions/runs/4019673114/jobs/6906840712#step:7:21)Error: The operation was canceled.

Hi,

No problem. Ah, that log from Github is interesting. Yep, that shows where it is failing. But not why. Strange!

One tip is you probably don’t want to leave jobs running for that maximum (6 hours) and then timing out, especially if you are paying by the minute for an Action. I’d recommend adding a timeout to your job. Since if it takes e.g 15 minutes to deploy, it’s probably safe to assume it’s gone wrong and won’t complete (deploys should take seconds/minutes, not hours) e.g

jobs:
  example:
    timeout-minutes: 15
    runs-on: ubuntu-latest
    steps:

See Workflow syntax for GitHub Actions - GitHub Docs

For the healthcheck, that looks like it may be called by Fly itself, independent of the deploy. As the timings show a call every 10 seconds, which suggests an automated process. That could be confirmed by looking at your fly.toml file. I’d assume in there you have a healthcheck defined, set to run every 10 seconds. If not … perhaps you have some kind of uptime-bot (Cloudflare, Pingdom or some such). So I don’t think that is related to the deployment timing out.

1 Like

Thanks for the tip, Greg! I added that line to my deploy config. Also, you’re right in that the fly.toml is configured to run the health check every 10s. Here’s my fly.toml. I tried just removing the healthcheck block and pushing that but there was no change. I don’t have a good enough understanding of this stuff to really diagnose the problem.

app = "goods"
kill_signal = "SIGINT"
kill_timeout = 5
processes = [ ]

[experimental]
allowed_public_ports = [ ]
auto_rollback = true
cmd = "start.sh"
entrypoint = "sh"

[mounts]
source = "data"
destination = "/data"

[[services]]
internal_port = 8080
processes = [ "app" ]
protocol = "tcp"
script_checks = [ ]

  [services.concurrency]
  hard_limit = 25
  soft_limit = 20
  type = "connections"

  [[services.ports]]
  handlers = [ "http" ]
  port = 80
  force_https = true

  [[services.ports]]
  handlers = [ "tls", "http" ]
  port = 443

  [[services.tcp_checks]]
  grace_period = "1s"
  interval = "15s"
  restart_limit = 0
  timeout = "2s"

  [[services.http_checks]]
  interval = "10s"
  grace_period = "5s"
  method = "get"
  path = "/healthcheck"
  protocol = "http"
  timeout = "2s"
  tls_skip_verify = false
  headers = { }

Hi,

No problem.

Ah, yep, that is the healthcheck and that’s why you get that showing in the logs. That’s fine, you want that to be there to check all is well.

Only thing I can perhaps suggest is seeing if you can add more debugging info to the data in the Github log. Only currently it just says “deploying” … and nothing else. Which isn’t very helpful of it. Maybe change:

run: flyctl deploy --remote-only

to

run: LOG_LEVEL=debug flyctl deploy --remote-only

(or equivalent in your action). No idea if that syntax is correct (you may need to specify a variable differently) however the idea is to tell the Fly CLI you want more debug data, which by default it won’t show. At worst it will fail but that’s fine as it’s not working anyway. It may show why it’s getting stuck and timing out.

1 Like

I ran that command and got some additional info. This would be repeatedly logged to the console:

DEBUG --> POST https://api.fly.io/graphql

{
  "query": "query ($appName: String!, $deploymentId: ID!, $evaluationId: String!) { app(name: $appName) { deploymentStatus(id: $deploymentId, evaluationId: $evaluationId) { id inProgress status successful description version desiredCount placedCount healthyCount unhealthyCount allocations { id idShort status region desiredStatus version healthy failed canary restarts checks { status serviceName } } } } }",
  "variables": {
    "appName": "goods",
    "deploymentId": "1a38eaf1-1d4c-f648-5bfa-328595a21420",
    "evaluationId": "4046b771-cd2d-b68f-6a75-fe53d8e33895"
  }
}

DEBUG {}
DEBUG <-- 200 https://api.fly.io/graphql (81.19ms)

{
  "data": {
    "app": {
      "deploymentStatus": {
        "id": "1a38eaf1-1d4c-f648-5bfa-328595a21420",
        "inProgress": true,
        "status": "running",
        "successful": false,
        "description": "Deployment is running",
        "version": 1,
        "desiredCount": 1,
        "placedCount": 0,
        "healthyCount": 0,
        "unhealthyCount": 0,
        "allocations": []
      }
    }
  }
}

I also destroyed my app and re-deployed with the same result - I was hoping that would work. At this point I think I need the Fly team to look into this issue. I’ve wasted several hours on this already :sweat:

1 Like

Hey Fly team, if anyone can help me with this, it would be greatly appreciated. GitHub support reached out and told me they can’t do anything to help me:

I also noticed that the step failing was running on a third-party action https://github.com/superfly/flyctl-actions and we are not the maintainers of the action. I’ll recommend you open an issue in the third-party repository.

Anything I can try to troubleshoot this issue?

Still struggling with this and not seeing a way forward. One interesting thing I noticed is the Monitoring dashboard is totally blank during a deploy. No logs ever come. App is stuck in Pending state.

I’ve been having issues with an existing Remix app, and after trying everything else, I was finally able to deploy again by using the --local-only flag on fly deploy. If you look at Deploying to Fly via GitHub Action failing - #19 by michael, it appears there’s some issues with deploying Remix apps right now.

Thanks for the reply, emiljt. Unfortunately, I tried that and got the same result. “Deployment is running” but never completes or shows any logs. I’m glad it worked for you though. I’ll check out that thread again and see if any of the fixes discussed there work.

Hi @grahamhagenah, it looks like your app was affected by a bug on our end that was causing a small handful of apps with volumes on two specific hosts (one in ord, one in iad) to get stuck and fail to deploy. It took a few days to track down the root cause of this issue, but we pushed a fix a few hours ago that should have unblocked your app’s deployment. Sorry for the trouble, I appreciate the report and all the debugging effort you put into this!

Thank you for the update, wjordan! It’s a relief to know I’m not just foolish or crazy.

Hi @wjordan ! I’m having this same issue but my machine is in sjc.
I have one app in sea and the deployment works fine.
I have one app in sjc and the github actions hangs on “deploying”… it usually takes 1-2min for the deployment to finish and I’ve tried to restart a couple times already and it takes more than 8 minutes until I cancel it…

Any insights?
This used to work literally two days ago … If I push the same code to the app in sea region it works.