We’re seeing an outage of over 2 minutes each time we deploy our Phoenix LiveView application. Each time we deploy, during the period in which some instances are passing health checks but not all desired instances have been placed, a browser is able to request the page for a LiveView and the static HTML is served successfully but the websocket connection fails. After some time, usually over 2 minutes, for a brief period a user loading a page will see “Internal Server Error”, after which everything is functional and websocket connections succeed. This seems to correspond with the point at which the health checks for all instances are passing.
This screen recording illustrates what I’m describing. I’ve also pasted in the deployment logs from the period during which this was being recorded below.
FYI we’re at scale 3, and we’re on phoenix 1.6.2, phoenix_live_view 0.17.6, Elixir 1.13, and OTP 24.2.
Deployment logs:
[...]
==> Monitoring deployment
3 desired, 1 placed, 0 healthy, 0 unhealthy
[health checks: 1 total 3 desired, 1 placed, 0 healthy, 0 unhealthy
[health checks: 1 total 3 desired, 1 placed, 1 healthy, 0 unhealthy
[health checks: 1 total 3 desired, 2 placed, 1 healthy, 0 unhealthy
[health checks: 1 total 3 desired, 2 placed, 1 healthy, 0 unhealthy
[health checks: 1 total 3 desired, 2 placed, 1 healthy, 0 unhealthy
[health checks: 1 total 3 desired, 2 placed, 1 healthy, 0 unhealthy
[health checks: 2 total 3 desired, 2 placed, 1 healthy, 0 unhealthy
[health checks: 2 total 3 desired, 2 placed, 1 healthy, 0 unhealthy
[health checks: 2 total 3 desired, 2 placed, 2 healthy, 0 unhealthy
[health checks: 2 total 3 desired, 3 placed, 2 healthy, 0 unhealthy
[health checks: 2 total 3 desired, 3 placed, 2 healthy, 0 unhealthy
[health checks: 2 total 3 desired, 3 placed, 2 healthy, 0 unhealthy
[health checks: 2 total 3 desired, 3 placed, 2 healthy, 0 unhealthy
[health checks: 3 total 3 desired, 3 placed, 2 healthy, 0 unhealthy
[health checks: 3 total 3 desired, 3 placed, 3 healthy, 0 unhealthy
[health checks: 3 total 3 desired, 3 placed, 3 healthy, 0 unhealthy
[health checks: 3 total, 3 passing]
fly.production.toml
:
app = "production"
kill_signal = "SIGTERM"
kill_timeout = 5
processes = []
[deploy]
release_command = "/app/bin/migrate"
[experimental]
allowed_public_ports = []
auto_rollback = true
[[services]]
http_checks = []
internal_port = 4000
processes = ["app"]
protocol = "tcp"
script_checks = []
[services.concurrency]
hard_limit = 25
soft_limit = 20
type = "connections"
[[services.ports]]
handlers = ["http"]
port = 80
[[services.ports]]
handlers = ["tls", "http"]
port = 443
[[services.tcp_checks]]
grace_period = "30s" # allow some time for startup
interval = "15s"
restart_limit = 6
timeout = "2s"