health-check failing during blue-green deployment


When I try to deploy using blue-green strategy, i have following error:

 ✔ release_command 1857434c1406e8 completed successfully
Updating existing machines in 'prod' with bluegreen strategy

Verifying if app can be safely deployed 

Creating green machines
  Created machine 7811e4eb593e98 [app]
  Created machine d8d9242a2e6328 [app]

Waiting for all green machines to start
  Machine 7811e4eb593e98 [app] - started
  Machine d8d9242a2e6328 [app] - started

Waiting for all green machines to be healthy
  Machine 7811e4eb593e98 [app] - 0/1 passing
  Machine d8d9242a2e6328 [app] - 0/1 passing
Deployment failed after error: could not get all green machines to be healthy: wait timeout

Rolling back failed deployment
Checking DNS configuration for
Error: could not get all green machines to be healthy: wait timeout
Your machine never reached the state "%s".

You can try increasing the timeout with the --wait-timeout flag

here is my fly.toml configuration:

app = "prod"
primary_region = "fra"
kill_signal = "SIGTERM"


  release_command = "/app/bin/migrate"
  strategy = "bluegreen"

  DNS_CLUSTER_QUERY = "prod.internal"
  PHX_HOST = "website.url"
  PORT = "8080"
  RELEASE_COOKIE = "cookie"

  internal_port = 8080
  force_https = true
  auto_stop_machines = false
  auto_start_machines = true
  min_machines_running = 2
  processes = ["app"]
    type = "connections"
    hard_limit = 1000
    soft_limit = 1000
  interval = "5s"
  grace_period = "20s"
  method = "GET"
  path = "/check"
  protocol = "http"
  port = 8080
  timeout = "5s"

any idea why is it failing?

when i run fly checks list it is showing:

Health Checks for prod
  NAME                      | STATUS   | MACHINE        | LAST UPDATED | OUTPUT                       
  bg_deployments_http       | critical | 7811e4eb593e98 | 4m40s ago    | connect: connection refused  
  servicecheck-00-http-8080 | warning  | 7811e4eb593e98 | 4m44s ago    | waiting for status update    
  bg_deployments_http       | critical | d8d9242a2e6328 | 4m43s ago    | connect: connection refused  
  servicecheck-00-http-8080 | warning  | d8d9242a2e6328 | 3m58s ago    | waiting for status update    

Also, if I check the logs of the started machines, they seem to be up and running:

fly logs -a prod -i e286013f95d5d8

Waiting for logs...

2024-05-02T12:29:32.100 runner[e286013f95d5d8] fra [info] Pulling container image

2024-05-02T12:29:33.199 runner[e286013f95d5d8] fra [info] Successfully prepared image (1.098938948s)

2024-05-02T12:29:34.022 runner[e286013f95d5d8] fra [info] Configuring firecracker

2024-05-02T12:29:34.552 app[e286013f95d5d8] fra [info] [ 0.156665] PCI: Fatal: No config space access function found

2024-05-02T12:29:34.800 app[e286013f95d5d8] fra [info] INFO Starting init (commit: c1e2693b)...

2024-05-02T12:29:34.868 app[e286013f95d5d8] fra [info] INFO Preparing to run: `/app/bin/server` as nobody

2024-05-02T12:29:34.881 app[e286013f95d5d8] fra [info] INFO [fly api proxy] listening at /.fly/api

2024-05-02T12:29:34.901 app[e286013f95d5d8] fra [info] 2024/05/02 12:29:34 INFO SSH listening listen_address=[fdaa:2:be18:a7b:caca:5ed2:149a:2]:22 dns_server=[fdaa::3]:53

2024-05-02T12:29:34.935 runner[e286013f95d5d8] fra [info] Machine created and started in 2.998s

2024-05-02T12:29:39.266 app[e286013f95d5d8] fra [info] 12:29:39.265 [info] no parent found, :ignore

2024-05-02T12:29:39.368 app[e286013f95d5d8] fra [info] 12:29:39.368 [info] Oban running in primary region. Activated.

2024-05-02T12:29:39.370 app[e286013f95d5d8] fra [info] 12:29:39.369 [info] Detected running on primary. No local replication to track.

2024-05-02T12:29:39.376 app[e286013f95d5d8] fra [info] 12:29:39.376 [info] Running NexusWeb.Endpoint with cowboy 2.10.0 at :::8080 (http)

2024-05-02T12:29:39.389 app[e286013f95d5d8] fra [info] 12:29:39.384 [info] Access NexusWeb.Endpoint at https://website.url

2024-05-02T12:29:39.390 app[e286013f95d5d8] fra [info] 12:29:39.389 [info] Discovered node :"prod-01HWWMFMGSQT7G2VRQZYDNXWDY@fdaa:2:be18:a7b:caca:14b:e986:2" in region fra

2024-05-02T12:29:39.886 app[e286013f95d5d8] fra [info] WARN Reaped child process with pid: 374 and signal: SIGUSR1, core dumped? false

2024-05-02T12:29:42.913 app[e286013f95d5d8] fra [info] 12:29:42.912 [info] tzdata release in place is from a file last modified Fri, 22 Oct 2021 02:20:47 GMT. Release file on server was last modified Thu, 01 Feb 2024 18:40:48 GMT.

2024-05-02T12:29:44.325 app[e286013f95d5d8] fra [info] 12:29:44.325 [info] Tzdata has updated the release from 2021e to 2024a

Any advice would be much appreciated. Thanks.

I believe it’s a Phoenix app. I have two guesses!

  1. Your health check port should not be 8080. Try removing that from your toml check section. Our proxy verifies the “external connection” (meaning 80 for http and 443 for https)
  2. Sometimes Plug.SSL can be :melting_face:. See: Phoenix http health checks - #2 by andykent

I believe there’s also other example folks adding health check on their phoenix apps here at community, free free to search for more examples.

Added elixir

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.