I spent probably 5-6 hours trying to deploy a static site and I’ve already had two running for the past two years. Did a from scratch install with fly launch and it went rather horrifically.
Initial attempt with fly launch
Dockerfile
FROM pierrezemb/gostatic
COPY fly_docker1/ /srv/http/
CMD ["-port", "8080", "-enable-health" , "-enable-logging"]
fly.toml
Using fly launch
gives:
app = "gurubani"
primary_region = "sea"
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0
But we probably should have some http checks, so the final file looks like:
app = "gurubani"
primary_region = "sea"
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0
[[services.http_checks]]
port = 8080
interval = 10000
grace_period = "5s"
method = "get"
path = "/health"
protocol = "http"
restart_limit = 0
timeout = 2000
tls_skip_verify = false
Result
The machine never returns healthy. I can manually hit the /health endpoint and see it return in the logs though.
2023-06-18T23:51:41Z app[6e82d4d9c7d738] sea [info]11:51PM DBG Returning Service Health
Updating existing machines in 'gurubani' with rolling strategy
[1/1] Waiting for 6e82d4d9c7d738 [app] to become healthy: 0/1
Attempt 2
OK, fine, maybe something is wrong with the health check. Let’s take it out and keep the original generated file.
fly.toml
app = "gurubani"
primary_region = "sea"
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0
fly deploy
runs and fly m clone --region ord --select
works.
But then even with both machines running, a request could take 30 seconds.
time curl -4 -v https://gurubani.fly.dev/health
* Trying 66.241.125.20:443...
* Connected to gurubani.fly.dev (66.241.125.20) port 443 (#0)
* found 159 certificates in /run/current-system/profile/etc/ssl/certs/ca-certificates.crt
* found 486 certificates in /home/tjheeta/.guix-profile/etc/ssl/certs
* GnuTLS ciphers: NORMAL:-ARCFOUR-128:-CTYPE-ALL:+CTYPE-X509:-VERS-SSL3.0
* ALPN: offers h2
* ALPN: offers http/1.1
* SSL connection using TLS1.3 / ECDHE_RSA_AES_256_GCM_SHA384
* server certificate verification OK
* server certificate status verification SKIPPED
* common name: *.fly.dev (matched)
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: EC/ECDSA
* certificate version: #3
* subject: CN=*.fly.dev
* start date: Fri, 09 Jun 2023 23:43:51 GMT
* expire date: Thu, 07 Sep 2023 23:43:50 GMT
* issuer: C=US,O=Let's Encrypt,CN=R3
* ALPN: server accepted h2
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* h2h3 [:method: GET]
* h2h3 [:path: /health]
* h2h3 [:scheme: https]
* h2h3 [:authority: gurubani.fly.dev]
* h2h3 [user-agent: curl/7.84.0]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0xec82e0)
> GET /health HTTP/2
> Host: gurubani.fly.dev
> user-agent: curl/7.84.0
> accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 32)!
< HTTP/2 200
< date: Mon, 19 Jun 2023 00:00:28 GMT
< content-length: 2
< content-type: text/plain; charset=utf-8
< server: Fly/a0b91024 (2023-06-13)
< via: 2 fly.io
< fly-request-id: 01H38F7KZ31BHQQ4YH8EWHVVNW-lax
<
* Connection #0 to host gurubani.fly.dev left intact
Ok
real 0m9.134s
user 0m0.052s
sys 0m0.005s
Doing a watch command watch 'time curl -4 https://gurubani.fly.dev/health'
and looking at the logs:
2023-06-19T00:02:27Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:30Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:32Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:34Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:36Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:38Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:40Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:42Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:56Z app[287440dc047608] ord [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:42Z app[6e82d4d9c7d738] sea [info]12:02AM DBG Returning Service Health
2023-06-19T00:02:56Z app[287440dc047608] ord [info]12:02AM DBG Returning Service Health
2023-06-19T00:03:03Z app[287440dc047608] ord [info]12:03AM DBG Returning Service Health
2023-06-19T00:03:05Z app[287440dc047608] ord [info]12:03AM DBG Returning Service Health
Why did it switch to ord? Why did the second request to ord take 7 seconds afterwards?
And what is this?
2023-06-19T00:04:05Z app[287440dc047608] ord [info]12:04AM DBG Returning Service Health
error.message="could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? has your app's instances all reached their hard limit?)" 2023-06-19T00:05:32Z proxy lax [error]request.method="GET" request.url="https://www.gurubani.org/health" request.id="01H38FEJKV9PHAFRGRXEAMZ135-lax" response.status=503
2023-06-19T00:05:54Z app[287440dc047608] ord [info]12:05AM DBG Returning Service Health
And this is a 40 second pause to return an OK?
2023-06-19T00:11:06Z app[287440dc047608] ord [info]12:11AM DBG Returning Service Health
2023-06-19T00:11:46Z app[287440dc047608] ord [info]12:11AM DBG Returning Service Health
Both machines are running:
fly m ls
2 machines have been retrieved from app gurubani.
View them in the UI here (https://fly.io/apps/gurubani/machines/)
gurubani
ID NAME STATE REGION IMAGE IP ADDRESS VOLUME CREATED LAST UPDATED APP PLATFORM PROCESS GROUP SIZE
6e82d4d9c7d738 bitter-paper-8833 started sea gurubani:deployment-01H38EWS2PT21SQKYND57DPMHH fdaa:0:7ce1:a7b:105:7fdf:d028:2 2023-06-18T23:29:28Z 2023-06-19T00:07:24Z v2 app shared-cpu-1x:256MB
287440dc047608 broken-shape-9191 started ord gurubani:deployment-01H38EWS2PT21SQKYND57DPMHH fdaa:0:7ce1:a7b:f4:55fa:6905:2 2023-06-18T23:55:27Z 2023-06-19T00:01:53Z v2 app shared-cpu-1x:256MB
And when this happens, it doesn’t return from any test endpoint, so a static site is down? We’re not even doing autostart/autostop at this point, just routing to live instances is inconsistent off the basic fly.toml that is generated from fly launch.
Fine, let’s follow the static website instructions
fly.toml for static from github
app = "gurubani2"
primary_region = "sjc"
kill_signal = "SIGINT"
kill_timeout = "5s"
[[services]]
protocol = "tcp"
internal_port = 8080
processes = ["app"]
[[services.ports]]
port = 80
handlers = ["http"]
force_https = true
[[services.ports]]
port = 443
handlers = ["tls", "http"]
[services.concurrency]
type = "connections"
hard_limit = 25
soft_limit = 20
[[services.tcp_checks]]
interval = "15s"
timeout = "2s"
grace_period = "1s"
restart_limit = 0
And cloned a second machine in ord and destroyed the one in sjc.
And everything seems to work fine.
But how about autostart and autostop now?
app = "gurubani2"
primary_region = "sjc"
kill_signal = "SIGINT"
kill_timeout = "5s"
[[services]]
auto_start_machines = true
auto_stop_machines = true
min_machines_running = 0
protocol = "tcp"
internal_port = 8080
processes = ["app"]
[[services.ports]]
port = 80
handlers = ["http"]
force_https = true
[[services.ports]]
port = 443
handlers = ["tls", "http"]
[services.concurrency]
type = "connections"
hard_limit = 25
soft_limit = 20
[[services.tcp_checks]]
interval = "15s"
timeout = "2s"
grace_period = "1s"
restart_limit = 0
Both scaling to zero and scaling up on a request seem to work properly.
Conclusion
The http services setup / checks are not working as expected.