Fly Deploy fails suddenly without having made any configuration changes.
Rails 7.1 application using litefs. Yesterday deploys were working fine, today with only minor code changes, deploys are failing.
In fly logs, the server starts up just fine, but there is never an /up call made. If i manually restart the machines, the /up calls are made. Deploy fails after that if I try again.
I see no errors that would indicate the server started improperly. The startup appears the same between restarting the machines successfully and the failed deploy.
I’ve tried fly deploy --local-only and fly deploy --remote-only without success. Both hang in the same spot.
The error shown:
-------
✖ [1/3] Machine {machine_Id} [app] update failed: timeout reached waiting for health checks to pass for machine {machine_Id}: failed to get VM {machine_Id}: Get…
[2/3] Waiting for job
[3/3] Waiting for job
-------
Checking DNS configuration for critical.fly.dev
Error: timeout reached waiting for health checks to pass for machine {machine_Id}: failed to get VM {machine_Id}: Get "https://api.machines.dev/v1/apps/{appname}/machines/{machine_Id}": net/http: request canceled
Your machine never reached the state "%s".
You can try increasing the timeout with the --wait-timeout flag
Is there anything else I can try? I’d really like to continue using fly.io. When it works, it’s fast and fantastic, but when it’s not working I’m lost compared to AWS.
More info:
# fly.toml app configuration file generated for {app_name} on 2024-02-29T20:17:11-06:00
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#
app = 'critical'
primary_region = 'ord'
console_command = '/rails/bin/rails console'
[build]
[[mounts]]
source = 'data'
destination = '/data'
[http_service]
internal_port = 3000
force_https = true
auto_stop_machines = false
auto_start_machines = true
min_machines_running = 0
processes = ['app']
[checks]
[checks.status]
port = 3000
type = 'http'
interval = '10s'
timeout = '2s'
grace_period = '5s'
method = 'GET'
path = '/up'
protocol = 'http'
tls_skip_verify = false
[checks.status.headers]
X-Forwarded-Proto = 'https'
[[vm]]
memory = '1gb'
cpu_kind = 'shared'
cpus = 1
[[statics]]
guest_path = '/rails/public'
url_prefix = '/'
2024-03-12T01:24:52Z app[48ed17eb2ed728] ord [info]waiting for signal or subprocess to exit
2024-03-12T01:24:52Z app[48ed17eb2ed728] ord [info]level=INFO msg="connected to cluster, ready"
2024-03-12T01:24:52Z app[48ed17eb2ed728] ord [info]level=INFO msg="proxy server listening on: http://localhost:3000"
2024-03-12T01:24:54Z app[48ed17eb2ed728] ord [info]W, [2024-03-12T01:24:54.156940 #320] WARN -- : You are running SQLite in production, this is generally not recommended. You can disable this warning by setting "config.active_record.sqlite3_production_warning=false".
2024-03-12T01:24:54Z app[48ed17eb2ed728] ord [info]=> Booting Puma
2024-03-12T01:24:54Z app[48ed17eb2ed728] ord [info]=> Rails 7.1.3.2 application starting in production
2024-03-12T01:24:54Z app[48ed17eb2ed728] ord [info]=> Run `bin/rails server --help` for more startup options
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]W, [2024-03-12T01:24:55.062531 #314] WARN -- : You are running SQLite in production, this is generally not recommended. You can disable this warning by setting "config.active_record.sqlite3_production_warning=false".
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]Puma starting in single mode...
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]* Puma version: 6.4.2 (ruby 3.2.1-p31) ("The Eagle of Durango")
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]* Min threads: 5
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]* Max threads: 5
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]* Environment: production
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]* PID: 314
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]* Listening on http://0.0.0.0:3001
2024-03-12T01:24:56Z app[48ed17eb2ed728] ord [info]Use Ctrl-C to stop
Hi @cooper! Would be possible your app is listening on a different port than the expected by fly.toml definition?
From the config file, it looks like the app is expected to be listening on port 3000 to reply to /up health check, on the app logs it seems to be that Puma is listening on port 3001 and is LiteFS http proxy listening on port 3000.
Please let me know if this helps to resolve your issue, thanks!
I’m using the default dockerfile that is built from the fly github repo ‘dockerfile-rails’. This sets the PORT ENV as 3001 (presumably for litefs), and the EXPOSE as 3000.
This has been working prior to yesterday.
Looks like others may be experiencing the same issue. It’s almost as if the fly proxy isn’t pinging the health check properly right now. I see the server start up just fine, but don’t see any good/failed attempts by the fly proxy to hit /up in the logs.
Updating existing machines in 'demo-rubys-23452' with rolling strategy
-------
✖ [1/2] Machine 48e5541f7201e8 [app] update failed: timeout reached waiting for health checks to pass for mach…
[2/2] Waiting for job
-------
Checking DNS configuration for demo-rubys-23452.fly.dev
Error: timeout reached waiting for health checks to pass for machine 48e5541f7201e8: failed to get VM 48e5541f7201e8: Get "https://api.machines.dev/v1/apps/demo-rubys-23452/machines/48e5541f7201e8": net/http: request canceled
Your machine never reached the state "%s".
You can try increasing the timeout with the --wait-timeout flag
Hi @rubys, I am still facing this issue when initially deploying a new app via flyctl deploy immediately after flyctl launch. The logs show my app started successfully, but the healthchecks are not firing. If I run flyctl deploy again, after the initial failure, then the health checks will function properly and the command will succeed.
The problem for me is that I have these commands running in Github Actions to automate PR preview deployments and I have some initial app bootstrapping that needs to run only once when the app is created. Since the deploy doesn’t succeed, my app ends up in a broken state which I have to resolve manually.
My PR deployments were working fine up until a new PR was created yesterday.
Hi @jamal ! Would be possible to get the application name to help troubleshoot your issue? If you prefer to keep this information confidential you can email to support@fly.io .Thanks!
Hi @aschiavo, I just created a new PR to reproduce the issue with a newly created app. The initial deployment has just failed, so you can see the healthchecks are not firing. The app name is whisker-ocr-pr-30.
Hi @stephentgrammer ! Are you facing the issue with fly deploy immediately after fly launch or the initial issue on this topic fly deploy on an existing app? Thanks!