HTTP services/checks don't work as expected

I spent probably 5-6 hours trying to deploy a static site and I’ve already had two running for the past two years. Did a from scratch install with fly launch and it went rather horrifically.

Initial attempt with fly launch

Dockerfile

FROM pierrezemb/gostatic
COPY fly_docker1/ /srv/http/
CMD ["-port", "8080",  "-enable-health" , "-enable-logging"]

fly.toml
Using fly launch gives:

app = "gurubani"
primary_region = "sea"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0

But we probably should have some http checks, so the final file looks like:

app = "gurubani"
primary_region = "sea"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0

[[services.http_checks]]
    port = 8080
    interval = 10000
    grace_period = "5s"
    method = "get"
    path = "/health"
    protocol = "http"
    restart_limit = 0
    timeout = 2000
    tls_skip_verify = false

Result

The machine never returns healthy. I can manually hit the /health endpoint and see it return in the logs though.

2023-06-18T23:51:41Z app[6e82d4d9c7d738] sea [info]11:51PM DBG Returning Service Health
Updating existing machines in 'gurubani' with rolling strategy
  [1/1] Waiting for 6e82d4d9c7d738 [app] to become healthy: 0/1

Attempt 2

OK, fine, maybe something is wrong with the health check. Let’s take it out and keep the original generated file.

fly.toml

app = "gurubani"
primary_region = "sea"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0

fly deploy runs and fly m clone --region ord --select works.

But then even with both machines running, a request could take 30 seconds.

time curl -4 -v  https://gurubani.fly.dev/health
*   Trying 66.241.125.20:443...
* Connected to gurubani.fly.dev (66.241.125.20) port 443 (#0)
* found 159 certificates in /run/current-system/profile/etc/ssl/certs/ca-certificates.crt
* found 486 certificates in /home/tjheeta/.guix-profile/etc/ssl/certs
* GnuTLS ciphers: NORMAL:-ARCFOUR-128:-CTYPE-ALL:+CTYPE-X509:-VERS-SSL3.0
* ALPN: offers h2
* ALPN: offers http/1.1
* SSL connection using TLS1.3 / ECDHE_RSA_AES_256_GCM_SHA384
*   server certificate verification OK
*   server certificate status verification SKIPPED
*   common name: *.fly.dev (matched)
*   server certificate expiration date OK
*   server certificate activation date OK
*   certificate public key: EC/ECDSA
*   certificate version: #3
*   subject: CN=*.fly.dev
*   start date: Fri, 09 Jun 2023 23:43:51 GMT
*   expire date: Thu, 07 Sep 2023 23:43:50 GMT
*   issuer: C=US,O=Let's Encrypt,CN=R3
* ALPN: server accepted h2
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* h2h3 [:method: GET]
* h2h3 [:path: /health]
* h2h3 [:scheme: https]
* h2h3 [:authority: gurubani.fly.dev]
* h2h3 [user-agent: curl/7.84.0]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0xec82e0)
> GET /health HTTP/2
> Host: gurubani.fly.dev
> user-agent: curl/7.84.0
> accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS == 32)!
< HTTP/2 200 
< date: Mon, 19 Jun 2023 00:00:28 GMT
< content-length: 2
< content-type: text/plain; charset=utf-8
< server: Fly/a0b91024 (2023-06-13)
< via: 2 fly.io
< fly-request-id: 01H38F7KZ31BHQQ4YH8EWHVVNW-lax
< 
* Connection #0 to host gurubani.fly.dev left intact
Ok
real    0m9.134s
user    0m0.052s
sys     0m0.005s

Doing a watch command watch 'time curl -4 https://gurubani.fly.dev/health' and looking at the logs:

2023-06-19T00:02:27Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:30Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:32Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:34Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:36Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:38Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:40Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:42Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:56Z app[287440dc047608] ord [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:42Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:56Z app[287440dc047608] ord [info]12:02AM DBG Returning Service Health
2023-06-19T00:03:03Z app[287440dc047608] ord [info]12:03AM DBG Returning Service Health
2023-06-19T00:03:05Z app[287440dc047608] ord [info]12:03AM DBG Returning Service Health

Why did it switch to ord? Why did the second request to ord take 7 seconds afterwards?

And what is this?

2023-06-19T00:04:05Z app[287440dc047608] ord [info]12:04AM DBG Returning Service Health
error.message="could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? has your app's instances all reached their hard limit?)" 2023-06-19T00:05:32Z proxy lax [error]request.method="GET" request.url="https://www.gurubani.org/health" request.id="01H38FEJKV9PHAFRGRXEAMZ135-lax" response.status=503 
2023-06-19T00:05:54Z app[287440dc047608] ord [info]12:05AM DBG Returning Service Health

And this is a 40 second pause to return an OK?

2023-06-19T00:11:06Z app[287440dc047608] ord [info]12:11AM DBG Returning Service Health
2023-06-19T00:11:46Z app[287440dc047608] ord [info]12:11AM DBG Returning Service Health

Both machines are running:

fly m ls
2 machines have been retrieved from app gurubani.
View them in the UI here (​https://fly.io/apps/gurubani/machines/)

gurubani
ID              NAME                    STATE   REGION  IMAGE                                              IP ADDRESS                      VOLUME  CREATED                 LAST UPDATED               APP PLATFORM    PROCESS GROUP   SIZE                
6e82d4d9c7d738  bitter-paper-8833       started sea     gurubani:deployment-01H38EWS2PT21SQKYND57DPMHH     fdaa:0:7ce1:a7b:105:7fdf:d028:2         2023-06-18T23:29:28Z    2023-06-19T00:07:24Z       v2              app             shared-cpu-1x:256MB
287440dc047608  broken-shape-9191       started ord     gurubani:deployment-01H38EWS2PT21SQKYND57DPMHH     fdaa:0:7ce1:a7b:f4:55fa:6905:2          2023-06-18T23:55:27Z    2023-06-19T00:01:53Z       v2              app             shared-cpu-1x:256MB

And when this happens, it doesn’t return from any test endpoint, so a static site is down? We’re not even doing autostart/autostop at this point, just routing to live instances is inconsistent off the basic fly.toml that is generated from fly launch.

Fine, let’s follow the static website instructions
fly.toml for static from github

app = "gurubani2"
primary_region = "sjc"
kill_signal = "SIGINT"
kill_timeout = "5s"

[[services]]
  protocol = "tcp"
  internal_port = 8080
  processes = ["app"]

  [[services.ports]]
    port = 80
    handlers = ["http"]
    force_https = true

  [[services.ports]]
    port = 443
    handlers = ["tls", "http"]
  [services.concurrency]
    type = "connections"
    hard_limit = 25
    soft_limit = 20

  [[services.tcp_checks]]
    interval = "15s"
    timeout = "2s"
    grace_period = "1s"
    restart_limit = 0

And cloned a second machine in ord and destroyed the one in sjc.

And everything seems to work fine.

But how about autostart and autostop now?

app = "gurubani2"
primary_region = "sjc"
kill_signal = "SIGINT"
kill_timeout = "5s"

[[services]]
  auto_start_machines = true
  auto_stop_machines = true
  min_machines_running = 0
  protocol = "tcp"
  internal_port = 8080
  processes = ["app"]

  [[services.ports]]
    port = 80
    handlers = ["http"]
    force_https = true

  [[services.ports]]
    port = 443
    handlers = ["tls", "http"]
  [services.concurrency]
    type = "connections"
    hard_limit = 25
    soft_limit = 20

  [[services.tcp_checks]]
    interval = "15s"
    timeout = "2s"
    grace_period = "1s"
    restart_limit = 0

Both scaling to zero and scaling up on a request seem to work properly.

Conclusion

The http services setup / checks are not working as expected.

Ok. There’s a lot going on here.

First off, that fly.toml doesn’t look correct to me. I don’t think a [[services.http_checks]] can exist on it’s own without a parent [service].

Now, I took an extra look at your service and health-checks, and they are not in sync.
They checks aren’t being run against the /health endpoint as expected, which explains why the proxy is unable to route to the supposed service it’s supposed to be checking.
I can also see some checks to port 8043 that are timing out? Did you set that check?

I haven’t tried to reproduce it yet, but I believe this is just an issue of misconfiguration of fly.toml.

Can you modify your fly.toml to use [[services]] instead of [http_service] and redeploy your app?

I’m sure fly.toml is not correct. But it’s literally created by using fly launch and then using the configuration for http health checks at Fly Launch configuration (fly.toml) · Fly Docs . If it isn’t working and the docs and fly launch don’t give a sane configuration, how is anyone supposed to figure it out?

Not sure what you’re talking about. There are no checks. Even a check against index would be fine.

$ fly checks ls -a gurubani
Health Checks for gurubani
  NAME | STATUS | MACHINE | LAST UPDATED | OUTPUT  
-------*--------*---------*--------------*---------


$ fly m ls -a gurubani -j
....
# NOTHING RELATED TO CHECKS

Super-easy to reproduce. Just use fly launch with a static site Dockerfile and then use watch with curl. Just had an issue right now.

$ watch 'time curl -4  https://gurubani.fly.dev'
2023-06-19T15:07:57Z app[287440dc047608] ord [info]3:07PM DBG Returning Service Health
2023-06-19T15:07:57Z app[287440dc047608] ord [info]3:07PM DBG Returning Service Health
2023-06-19T15:07:59Z app[287440dc047608] ord [info]3:07PM DBG Returning Service Health
2023-06-19T15:08:01Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:03Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:05Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:07Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:10Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:12Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:14Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:16Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:18Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:20Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:23Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:25Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:27Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:29Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:31Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
### TOOK 14 seconds to return
2023-06-19T15:08:45Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:47Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
2023-06-19T15:08:49Z app[287440dc047608] ord [info]3:08PM DBG Returning Service Health
### TOOK 11 seconds to return
2023-06-19T15:09:00Z app[287440dc047608] ord [info]3:09PM DBG Returning Service Health
2023-06-19T15:09:02Z app[287440dc047608] ord [info]3:09PM DBG Returning Service Health

That’s what I did in step 3 of “Fine, let’s follow the static website instructions” because the http_service configuration given by fly launch was not working reliably. gurubani2 uses services, gurubani uses http_service.

I also emailed support and didn’t get a satisfactory response. There is some QA missing somewhere as this is a static site.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.