Fly deploy fails waiting for health checks suddenly - no configuration changes since last deploy

cooper · March 12, 2024, 1:32am

Fly Deploy fails suddenly without having made any configuration changes.

Rails 7.1 application using litefs. Yesterday deploys were working fine, today with only minor code changes, deploys are failing.

In fly logs, the server starts up just fine, but there is never an /up call made. If i manually restart the machines, the /up calls are made. Deploy fails after that if I try again.

I see no errors that would indicate the server started improperly. The startup appears the same between restarting the machines successfully and the failed deploy.

I’ve tried fly deploy --local-only and fly deploy --remote-only without success. Both hang in the same spot.

The error shown:

-------
 ✖ [1/3] Machine {machine_Id} [app] update failed: timeout reached waiting for health checks to pass for machine {machine_Id}: failed to get VM {machine_Id}: Get…
   [2/3] Waiting for job
   [3/3] Waiting for job
-------
Checking DNS configuration for critical.fly.dev
Error: timeout reached waiting for health checks to pass for machine {machine_Id}: failed to get VM {machine_Id}: Get "https://api.machines.dev/v1/apps/{appname}/machines/{machine_Id}": net/http: request canceled
Your machine never reached the state "%s".

You can try increasing the timeout with the --wait-timeout flag

Is there anything else I can try? I’d really like to continue using fly.io. When it works, it’s fast and fantastic, but when it’s not working I’m lost compared to AWS.

More info:

# fly.toml app configuration file generated for {app_name} on 2024-02-29T20:17:11-06:00
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#

app = 'critical'
primary_region = 'ord'
console_command = '/rails/bin/rails console'

[build]

[[mounts]]
  source = 'data'
  destination = '/data'

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = false
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[checks]
  [checks.status]
    port = 3000
    type = 'http'
    interval = '10s'
    timeout = '2s'
    grace_period = '5s'
    method = 'GET'
    path = '/up'
    protocol = 'http'
    tls_skip_verify = false

    [checks.status.headers]
      X-Forwarded-Proto = 'https'

[[vm]]
  memory = '1gb'
  cpu_kind = 'shared'
  cpus = 1

[[statics]]
  guest_path = '/rails/public'
  url_prefix = '/'

2024-03-12T01:24:52Z app[48ed17eb2ed728] ord [info]waiting for signal or subprocess to exit
2024-03-12T01:24:52Z app[48ed17eb2ed728] ord [info]level=INFO msg="connected to cluster, ready"
2024-03-12T01:24:52Z app[48ed17eb2ed728] ord [info]level=INFO msg="proxy server listening on: http://localhost:3000"
2024-03-12T01:24:54Z app[48ed17eb2ed728] ord [info]W, [2024-03-12T01:24:54.156940 #320]  WARN -- : You are running SQLite in production, this is generally not recommended. You can disable this warning by setting "config.active_record.sqlite3_production_warning=false".
2024-03-12T01:24:54Z app[48ed17eb2ed728] ord [info]=> Booting Puma
2024-03-12T01:24:54Z app[48ed17eb2ed728] ord [info]=> Rails 7.1.3.2 application starting in production
2024-03-12T01:24:54Z app[48ed17eb2ed728] ord [info]=> Run `bin/rails server --help` for more startup options
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]W, [2024-03-12T01:24:55.062531 #314]  WARN -- : You are running SQLite in production, this is generally not recommended. You can disable this warning by setting "config.active_record.sqlite3_production_warning=false".
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]Puma starting in single mode...
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]* Puma version: 6.4.2 (ruby 3.2.1-p31) ("The Eagle of Durango")
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]*  Min threads: 5
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]*  Max threads: 5
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]*  Environment: production
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]*          PID: 314
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]* Listening on http://0.0.0.0:3001
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]Use Ctrl-C to stop

mxs2019 · March 12, 2024, 3:43am

I’m having the same issue. Were you able to figure this out?

cooper · March 12, 2024, 4:01am

I haven’t figured it out yet. Upgraded to paid support and waiting on a response (it’s after hours now, I’m sure).

aschiavo · March 12, 2024, 8:59am

Hi @cooper! Would be possible your app is listening on a different port than the expected by fly.toml definition?

From the config file, it looks like the app is expected to be listening on port 3000 to reply to /up health check, on the app logs it seems to be that Puma is listening on port 3001 and is LiteFS http proxy listening on port 3000.

Please let me know if this helps to resolve your issue, thanks!

Common-Ground · March 12, 2024, 10:15am

Hi all, encountering the same issues here.
Health checks go red when starting a deploy then it fails at health checks timeout.

No recent configuration change, only trying to deploy a new docker image.

Also why is the configuration port looking like this? 8_880
My configuration is definitely setup for port number: 8880

Common-Ground · March 12, 2024, 11:00am

@cooper have you heard from support yet?

It looks like the machine health checks are going red for the machine being updated.
Then it cannot recover health and process timeout after 5m.

All ports are properly configured since the health checks are green when starting the update.

Fly team whats up???

cooper · March 12, 2024, 12:01pm

I’m using the default dockerfile that is built from the fly github repo ‘dockerfile-rails’. This sets the PORT ENV as 3001 (presumably for litefs), and the EXPOSE as 3000.

This has been working prior to yesterday.

Looks like others may be experiencing the same issue. It’s almost as if the fly proxy isn’t pinging the health check properly right now. I see the server start up just fine, but don’t see any good/failed attempts by the fly proxy to hit /up in the logs.

rubys · March 12, 2024, 12:04pm

Reproduction instructions, on a machine that has Rails 7.1 and Postgres installed, with a bash-like shell:

rm -rf demo
mkdir demo
cd demo
rails new . --database=postgresql
echo 'Rails.application.routes.draw { root "rails/welcome#index" }' >> config/routes.rb
fly launch --yes --name demo-$USER-$RANDOM
fly deploy
fly apps open
sleep 5
fly deploy

The first deploy succeeds. The second fails:

Updating existing machines in 'demo-rubys-23452' with rolling strategy

-------
 ✖ [1/2] Machine 48e5541f7201e8 [app] update failed: timeout reached waiting for health checks to pass for mach…
   [2/2] Waiting for job
-------
Checking DNS configuration for demo-rubys-23452.fly.dev
Error: timeout reached waiting for health checks to pass for machine 48e5541f7201e8: failed to get VM 48e5541f7201e8: Get "https://api.machines.dev/v1/apps/demo-rubys-23452/machines/48e5541f7201e8": net/http: request canceled
Your machine never reached the state "%s".

You can try increasing the timeout with the --wait-timeout flag

Common-Ground · March 12, 2024, 12:06pm

@cooper I found exactly the same issue.
I can scale up machines and it works, but deployments cannot ping the right health checks port it seems.

I also emailed support, I cannot deploy any app right now, it’s a major issue for my team.

aschiavo · March 12, 2024, 12:24pm

Hi @cooper, my apologize for the incorrect advice. I’ll continue investigating and update where with my finding.

rubys · March 12, 2024, 3:14pm

A fix has been applied; please retry deployments and let us know how it goes.

cooper · March 12, 2024, 3:34pm

I can confirm the issue is resolved. Thanks for the fix.

Common-Ground · March 12, 2024, 3:41pm

Thanks Fly team for the swift update

jamal · March 13, 2024, 7:55pm

Hi @rubys, I am still facing this issue when initially deploying a new app via flyctl deploy immediately after flyctl launch. The logs show my app started successfully, but the healthchecks are not firing. If I run flyctl deploy again, after the initial failure, then the health checks will function properly and the command will succeed.

The problem for me is that I have these commands running in Github Actions to automate PR preview deployments and I have some initial app bootstrapping that needs to run only once when the app is created. Since the deploy doesn’t succeed, my app ends up in a broken state which I have to resolve manually.

My PR deployments were working fine up until a new PR was created yesterday.

aschiavo · March 14, 2024, 12:41pm

Hi @jamal ! Would be possible to get the application name to help troubleshoot your issue? If you prefer to keep this information confidential you can email to support@fly.io .Thanks!

jamal · March 14, 2024, 3:47pm

Hi @aschiavo, I just created a new PR to reproduce the issue with a newly created app. The initial deployment has just failed, so you can see the healthchecks are not firing. The app name is whisker-ocr-pr-30.

aschiavo · March 14, 2024, 3:56pm

Thanks @jamal ! I’ll do some research and let you know.

stephentgrammer · March 14, 2024, 4:08pm

Same issue for us. It was fixed 2 days ago, but now is failing again.

aschiavo · March 14, 2024, 4:18pm

Hi @stephentgrammer ! Are you facing the issue with fly deploy immediately after fly launch or the initial issue on this topic fly deploy on an existing app? Thanks!

stephentgrammer · March 14, 2024, 4:33pm

fly deploy on an existing app! thanks.