Deploy Health Check fails for Elixir Phoenix app

I have a fairly straightforward Phoenix app that I am trying to deploy. The problem is likely simple but I need some rubber duck help. Here is the failure:

==> Failure #1
Instance
  ID            = <redacted>              Task          =                        Version       = 24                   
  Region        = sea                  
  Desired       = run                    Status        = running                Health Checks = 1 total, 1 critical  
  Restarts      = 2                      Created       = 4m35s ago            
Recent Events
TIMESTAMP            TYPE             MESSAGE                                                         2021-09-26T14:47:32Z Received         Task received by client                                         2021-09-26T14:47:32Z Task Setup       Building Task Directory                                         
2021-09-26T14:47:35Z Started          Task started by client                                          2021-09-26T14:49:22Z Restart Signaled healthcheck: check "f557ee1d070fbd1796f8c0685877b786" unhealthy 2021-09-26T14:49:31Z Terminated       Exit Code: 0                                                    
2021-09-26T14:49:31Z Restarting       Task restarting in 1.007520679s                                 
2021-09-26T14:49:38Z Started          Task started by client                                          2021-09-26T14:51:24Z Restart Signaled healthcheck: check "f557ee1d070fbd1796f8c0685877b786" unhealthy 2021-09-26T14:51:34Z Terminated       Exit Code: 0                                                    
2021-09-26T14:51:34Z Restarting       Task restarting in 1.165986767s                                 2021-09-26T14:51:40Z Started          Task started by client                                          
Recent Logs
2021-09-26T14:49:29Z [info] Health check status changed 'critical' => 'passing'
2021-09-26T14:49:36Z [info] Starting instance
2021-09-26T14:49:36Z [info] Configuring virtual machine
2021-09-26T14:49:36Z [info] Pulling container image
2021-09-26T14:49:37Z [info] Unpacking image
2021-09-26T14:49:37Z [info] Preparing kernel init
2021-09-26T14:49:37Z [info] Configuring firecracker
2021-09-26T14:49:37Z [info] Starting virtual machine
2021-09-26T14:49:38Z [info] Starting init (commit: 50ffe20)...
2021-09-26T14:49:38Z [info] Preparing to run: `bin/shrimp start` as nobody
2021-09-26T14:49:38Z [info] 2021/09/26 14:49:38 listening on [<redacted>]:22 (DNS: [fdaa::3]:53)
2021-09-26T14:49:39Z [info] Reaped child process with pid: 547, exit code: 0
2021-09-26T14:49:41Z [info] Reaped child process with pid: 568 and signal: SIGUSR1, core dumped? false
2021-09-26T14:51:24Z [info] Shutting down virtual machine
2021-09-26T14:51:24Z [info] Sending signal SIGTERM to main child process w/ PID 510
2021-09-26T14:51:25Z [info] Reaped child process with pid: 570 and signal: SIGUSR1, core dumped? false
2021-09-26T14:51:30Z [info] Health check status changed 'critical' => 'passing'
2021-09-26T14:51:39Z [info] Starting instance
2021-09-26T14:51:39Z [info] Configuring virtual machine
2021-09-26T14:51:39Z [info] Pulling container image
2021-09-26T14:51:39Z [info] Unpacking image
2021-09-26T14:51:39Z [info] Preparing kernel init
2021-09-26T14:51:40Z [info] Configuring firecracker
2021-09-26T14:51:40Z [info] Starting virtual machine
2021-09-26T14:51:40Z [info] Starting init (commit: 50ffe20)...
2021-09-26T14:51:40Z [info] Preparing to run: `bin/shrimp start` as nobody
2021-09-26T14:51:40Z [info] 2021/09/26 14:51:40 listening on [<redacted>]:22 (DNS: [fdaa::3]:53)
2021-09-26T14:51:41Z [info] Reaped child process with pid: 547, exit code: 0
2021-09-26T14:51:43Z [info] Reaped child process with pid: 568 and signal: SIGUSR1, core dumped? false
2021-09-26T14:51:55Z [error] Health check status changed 'warning' => 'critical'

In reviewing the Discourse topics it seems that this is most frequently caused by mismatched PORT settings between the Dockerfile, fly.toml, and the app’s config/runtime.exs, but all of those are in agreement to use PORT=4001.

Here is the fly.toml:

# fly.toml file generated for <redacted> on 2021-09-24T00:56:04Z

app = "<redacted>"

kill_signal = "SIGTERM"
kill_timeout = 5
processes = []

[deploy]
#  release_command = "/app/bin/shrimp eval Shrimp.Release.migrate"

[env]

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 4001
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "30s"
    interval = "15s"
    restart_limit = 6
    timeout = "2s"

I commented out the release_command to see if I could just get the app to start but it failed in either scenario (commented, or uncommented).

I also ran LOG_LEVEL=debug fly deploy but I won’t include it here as it was very overwhelming. I’m happy to grep through it for something specific.

This app does use an :ets table but even leaving that GenServer out of the Supervisor chain doesn’t seem to prevent the SIGUSR1 signal from ending the child processes. Additionally, the app isn’t sending any exception traces to Sentry.

I had a prior deploy of another app that had similar issues because the BEAM memory size was bumping up against the 256GB limit, so I scaled the memory on this vm to 2048 but that did not resolve the issue.

Lastly, I know it is Sunday still in most of the world so please do not address this until normal work hours, this is just the time I have to post. Thanks in advance for any help. Happy to provide any other info.

Can you confirm your config/runtime.exs is configured correctly? Looks like either the Endpoint is not starting or incorrectly configured.

Can you also paste in the contents of fly logs when running the deploy command?

Sure.

import Config

if config_env() == :prod do
  secret_key_base =
    System.get_env("SECRET_KEY_BASE") ||
      raise """
            environment variable SECRET_KEY_BASE is missing.
            You can generate one by calling: mix phx.gen.secret
      """
  lv_signing_salt =
    System.get_env("LV_SIGNING_SALT") ||
      raise """
            environment variable LV_SIGNING_SALT is missing.
            You can generate one by calling: mix phx.gen.secret
      """

  app_name =
    System.get_env("FLY_APP_NAME") ||
      raise "FLY_APP_NAME not available"

  config :elixir, :time_zone_database, Tzdata.TimeZoneDatabase

  config :shrimp, Shrimp.Endpoint,
    server: true,
    url: [host: "#{app_name}.fly.dev", port: 80],
    http: [
      port: String.to_integer(System.get_env("PORT") || "4001"),
      # IMPORTANT: support IPv6 Addresses
      transport_options: [socket_opts: [:inet6]]
    ],
  secret_key_base: secret_key_base,
  live_view: [signing_salt: lv_signing_salt]

  database_url =
    System.get_env("DATABASE_URL") ||
      raise """
            environment variable DATABASE_URL is missing.
            For example: ecto://USER:PASS@HOST/DATABASE
      """

  config :shrimp, Shrimp.Repo,
    url: database_url,
    # IMPORTANT: Or it won't find the DB server
    socket_options: [:inet6],
    pool_size: String.to_integer(System.get_env("POOL_SIZE") || "10")

  config :phoenix, :json_library, Jason

  config :ex_twilio,
    account_sid: System.get_env("TWILIO_ACCOUNT_SID"),
    auth_token: System.get_env("TWILIO_AUTH_TOKEN")

  dsn = System.get_env("SENTRY_DSN") ||
    raise """
      missing Sentry_DSN ENV var
    """

config :sentry,
    dsn: dsn,
    environment_name: :prod,
    enable_source_code_context: true,
    root_source_code_path: File.cwd!(),
    tags: %{
      env: "production"
    },
    included_environments: [:prod]

  config :logger,
    backends: [Sentry.LoggerBackend]
end

Here are the relevant lines from the recent deploy logs:

2021-09-27T16:19:35Z [info] Starting instance
2021-09-27T16:19:35Z [info] Configuring virtual machine
2021-09-27T16:19:35Z [info] Pulling container image
2021-09-27T16:19:36Z [info] Unpacking image
2021-09-27T16:19:36Z [info] Preparing kernel init
2021-09-27T16:19:36Z [info] Configuring firecracker
2021-09-27T16:19:36Z [info] Starting virtual machine
2021-09-27T16:19:36Z [info] Starting init (commit: 50ffe20)...
2021-09-27T16:19:36Z [info] Preparing to run: `bin/shrimp start` as nobody
2021-09-27T16:19:36Z [info] 2021/09/27 16:19:36 listening on [<redacted>]:22 (DNS: [fdaa::3]:53)
2021-09-27T16:19:37Z [info] Reaped child process with pid: 548, exit code: 0
2021-09-27T16:19:39Z [info] Reaped child process with pid: 569 and signal: SIGUSR1, core dumped? false
2021-09-27T16:19:48Z [error] Health check status changed 'warning' => 'critical'

I suspect that I might have the DATABASE_URL misconfigured now. I’ll report back if changing it resolves the problem.

I think I set the DATABASE_URL correctly but I do notice that I cannot detach this app from the DB cluster, running fly pg detach results in this:

Detaching...⡿ 
Error An unknown error occured.

If I run fly pg db list <cluster-name> I do see the configured db for this app listed along with the other db for my other app running on that cluster, and the app user is listed in the db cluster users list.

A Repo is configured for this app but there are no migrations. I don’t think that should be an issue since the db exists, but I thought I’d mention it anyway.

I knew that posting this problem would help me find the fix.

Posting led me to this thread with this answer by @Mark:

I realized almost instantly that my config/runtime.exs does have an error. Endpoints are configured by the web context, not the app context so I changed to configuring ShrimpWeb.Endpoint instead of Shrimp.Endpoint and it fixed the issue.

I’m sure that Chris and Mark are probably working on mistake-proofing these deploys. I’m happy to help if this is desired. I’m good at breaking things. :wink:

2 Likes