Hanging on 'Configuring firecracker" in ORD

The fly logs keep hanging on

2023-09-10T17:33:11.430 runner[328735da656718] ord [info] Pulling container image registry.fly.io/brick-drop-co:deployment-01HA02D2SCP8ESHVHDYF1T1EMV

2023-09-10T17:33:26.380 runner[328735da656718] ord [info] Successfully prepared image registry.fly.io/brick-drop-co:deployment-01HA02D2SCP8ESHVHDYF1T1EMV (14.950077281s)

2023-09-10T17:33:26.405 runner[328735da656718] ord [info] Setting up volume 'litefs'

2023-09-10T17:33:26.405 runner[328735da656718] ord [info] Opening encrypted volume

2023-09-10T17:33:26.786 runner[328735da656718] ord [info] Configuring firecracker

The fly consul shows this, and eventually times out -
[1/1] Waiting for 328735da656718 [web] to have state: started

This is one of two processes for the app. The other deploys just find, which makes me think this is not a issue like what happened with sin and maa. Though I did try the --local-build flag to see if that caused any improvements -

References:
lhr [info]Configuring firecracker Failing
[sin] Deploy failed, stuck in Configuring firecracker - #4 by jerome

The image can run locally just fine. It runs litefs, which I know it can find, as well as the config file. So it also does not match the issue that was seen with command not found, error code 127.

Reference - Deploy stuck on 'Configuring firecracker'

Like the previous linked issue, it can not be stopped at all, and only deleted with --force flag.

Here is the fly.toml with some lines cut.

primary_region = "ord"

[build]
  strategy = "canary"

[env]
  # ENVs cut, but are not used by the app that is failing.

[processes]
  # this app hangs
  web = "litefs mount -config /etc/litefs.web.yml"
  # this app works fine for health checks and runs app fine, but can't route too it, one problem at a time...
  dir = "litefs mount -config /etc/litefs.directus.yml"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ["web"]

  [http_service.concurrency]
    type = "connections"
    hard_limit = 50
    soft_limit = 25

  [http_service.http_options.response.headers]
    X-Process-Group = "web"
    X-Frame-Options = "SAMEORIGIN"
    X-XSS-Protection = "1; mode=block"
    X-Content-Type-Options = "nosniff"
    Referrer-Policy = "strict-origin-when-cross-origin"
    Content-Security-Policy = "default-src 'self' 'unsafe-inline' 'unsafe-eval' data:; img-src * data:; font-src * data:; style-src * 'unsafe-inline'; script-src * 'unsafe-inline' 'unsafe-eval'; connect-src *; frame-src *; object-src *; media-src *; child-src *; form-action *; frame-ancestors *; block-all-mixed-content; upgrade-insecure-requests; manifest-src *; worker-src *; prefetch-src *;"

  [[http_service.checks]]
    grace_period = "10s"
    interval = "30s"
    method = "GET"
    timeout = "5s"
    path = "/"

[[services]]
  internal_port = 8054
  protocol = "tcp"
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["dir"]

  # TODO: remove this once we confirm VPN/tunnel
  [[services.ports]]
    handlers = ["tls", "http"]
    port = 3000
    force_https = false

    [services.ports.http_options.response.headers]
      X-Process-Group = "dir"

  [services.concurrency]
    type = "connections"
    hard_limit = 25
    soft_limit = 20

  [[services.http_checks]]
    interval = 10000
    grace_period = "5s"
    method = "get"
    path = "/admin/login"
    protocol = "http"
    timeout = 2000
    tls_skip_verify = false
    [services.http_checks.headers]

[checks]
  [checks.web]
    grace_period = "8s"
    interval = "15s"
    method = "get"
    path = "/"
    port = 8080
    timeout = "30s"
    type = "http"
    processes = ["web"]
  [checks.dir]
    grace_period = "8s"
    interval = "60s"
    method = "get"
    path = "/admin/login"
    port = 8054
    timeout = "60s"
    type = "http"
    processes = ["dir"]

[mounts]
  source = "litefs"
  destination = "/var/lib/litefs"
  processes= ["web", "dir"]

[metrics]
  port = 9091       # default for most prometheus clients
  path = "/metrics"

Any help will be appreciate. This has me stumped.

I have run the go app locally just fine. And the base code for this app, is 100% the same as a POC I had running correctly before on fly.io, last week.

When it times out on the deploy, it says this -

WARN failed to release lease for machine 328735da656718: lease not found

Error: timeout reached waiting for machine to started failed to wait for VM 328735da656718 in started state: Get "https://api.machines.dev/v1/apps/brick-drop-co/machines/328735da656718/wait?instance_id=01HA02F49A8NW6ZTQK6MX0PNT1&state=started&timeout=60": net/http: request canceled
You can increase the timeout with the --wait-timeout flag

I tried to run fly m start <id> with LOG_LEVEL=DUBUG and got this -

DEBUG {
  "error": "failed_precondition: unable to start machine from current state: 'created'"
}

Tried cloning a instance with LOG_LEVEL also and I am not sure it showed anything interesting but the error at the end, which is just the internal message for timing out.

DEBUG {
  "error": "deadline_exceeded: machine failed to reach desired state, started, currently created"
}

I am not sure this is a solution, as much as a work around, would still love to know what is wrong.

I commented this out -

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["web"]

  [http_service.concurrency]
    type = "connections"
    hard_limit = 50
    soft_limit = 25

  [http_service.http_options.response.headers]
    X-Process-Group = "web"
    X-Frame-Options = "SAMEORIGIN"
    X-XSS-Protection = "1; mode=block"
    X-Content-Type-Options = "nosniff"
    Referrer-Policy = "strict-origin-when-cross-origin"
    Content-Security-Policy = "default-src 'self' 'unsafe-inline' 'unsafe-eval' data:; img-src * data:; font-src * data:; style-src * 'unsafe-inline'; script-src * 'unsafe-inline' 'unsafe-eval'; connect-src *; frame-src *; object-src *; media-src *; child-src *; form-action *; frame-ancestors *; block-all-mixed-content; upgrade-insecure-requests; manifest-src *; worker-src *; prefetch-src *;"

  [[http_service.checks]]
    grace_period = "240s"
    interval = "120s"
    method = "GET"
    timeout = "10s"
    path = "/"

And replaced it with

[[services]]
  internal_port = 8080
  protocol = "tcp"
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["web"]

  [[services.ports]]
    handlers = ["http"]
    port = 80
    force_https = true
    [services.ports.http_options.response.headers]
      X-Process-Group = "web"

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443
    [services.ports.http_options.response.headers]
      X-Process-Group = "web"

And now it works without any issues. I have no understanding why…

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.