LiveView app unavailable for 2+ minutes after deployment

We’re seeing an outage of over 2 minutes each time we deploy our Phoenix LiveView application. Each time we deploy, during the period in which some instances are passing health checks but not all desired instances have been placed, a browser is able to request the page for a LiveView and the static HTML is served successfully but the websocket connection fails. After some time, usually over 2 minutes, for a brief period a user loading a page will see “Internal Server Error”, after which everything is functional and websocket connections succeed. This seems to correspond with the point at which the health checks for all instances are passing.

This screen recording illustrates what I’m describing. I’ve also pasted in the deployment logs from the period during which this was being recorded below.

FYI we’re at scale 3, and we’re on phoenix 1.6.2, phoenix_live_view 0.17.6, Elixir 1.13, and OTP 24.2.

Deployment logs:

[...]
==> Monitoring deployment

 3 desired, 1 placed, 0 healthy, 0 unhealthy 
[health checks: 1 total 3 desired, 1 placed, 0 healthy, 0 unhealthy 
[health checks: 1 total 3 desired, 1 placed, 1 healthy, 0 unhealthy 
[health checks: 1 total 3 desired, 2 placed, 1 healthy, 0 unhealthy 
[health checks: 1 total 3 desired, 2 placed, 1 healthy, 0 unhealthy 
[health checks: 1 total 3 desired, 2 placed, 1 healthy, 0 unhealthy 
[health checks: 1 total 3 desired, 2 placed, 1 healthy, 0 unhealthy 
[health checks: 2 total 3 desired, 2 placed, 1 healthy, 0 unhealthy 
[health checks: 2 total 3 desired, 2 placed, 1 healthy, 0 unhealthy 
[health checks: 2 total 3 desired, 2 placed, 2 healthy, 0 unhealthy 
[health checks: 2 total 3 desired, 3 placed, 2 healthy, 0 unhealthy 
[health checks: 2 total 3 desired, 3 placed, 2 healthy, 0 unhealthy 
[health checks: 2 total 3 desired, 3 placed, 2 healthy, 0 unhealthy 
[health checks: 2 total 3 desired, 3 placed, 2 healthy, 0 unhealthy 
[health checks: 3 total 3 desired, 3 placed, 2 healthy, 0 unhealthy 
[health checks: 3 total 3 desired, 3 placed, 3 healthy, 0 unhealthy 
[health checks: 3 total 3 desired, 3 placed, 3 healthy, 0 unhealthy 
[health checks: 3 total, 3 passing]

fly.production.toml:

app = "production"

kill_signal = "SIGTERM"
kill_timeout = 5
processes = []

[deploy]
  release_command = "/app/bin/migrate"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 4000
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "30s"  # allow some time for startup
    interval = "15s"
    restart_limit = 6
    timeout = "2s"

What region is this running in?

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS     	RESTARTS	CREATED   
fe295281	app    	458    	ewr(B)	run    	running	1 total, 1 passing	0       	1h40m ago	
fcba76b6	app    	458    	vin(B)	run    	running	1 total, 1 passing	0       	1h41m ago	
951310f7	app    	458    	iad   	run    	running	1 total, 1 passing	0       	1h42m ago

Try running fly regions backup iad and see if that helps?

Hmmm just tried this and deployed, and got:

Could not create schema migrations table. This error usually happens due to the following:
* The database does not exist

We haven’t changed our configuration at all.

More context:

image: registry.fly.io/production:deployment-1642966337
image size: 66 MB
==> Creating release
--> release v461 created

--> You can detach the terminal anytime without stopping the deployment
==> Release command detected: /app/bin/migrate

--> This release will not be available until the release command succeeds.
	 Starting instance
	 Configuring virtual machine
	 Pulling container image
	 Unpacking image
	 Preparing kernel init
	 Configuring firecracker
	 Starting virtual machine
	 Starting init (commit: 0c50bff)...
	 Preparing to run: `/app/bin/migrate` as nobody
	 2022/01/23 19:34:27 listening on [fdaa:0:309a:a7b:ab8:ec2a:3bf1:2]:22 (DNS: [fdaa::3]:53)
	 Reaped child process with pid: 563 and signal: SIGUSR1, core dumped? false
	 19:34:31.338 [error] Could not create schema migrations table. This error usually happens due to the following:
	   * The database does not exist
	 "mix ecto.create". Alternatively you may configure Ecto to use
	     config :enaia, Enaia.Repo,
	 See DBConnection.start_link/2 for more information
	     (ecto_sql 3.7.1) lib/ecto/migrator.ex:678: Ecto.Migrator.verbose_schema_migration/3
	     (ecto_sql 3.7.1) lib/ecto/migrator.ex:419: Ecto.Migrator.run/4
	     (ecto_sql 3.7.1) lib/ecto/migrator.ex:146: Ecto.Migrator.with_repo/3
	 Main child exited normally with code: 1
	 Reaped child process with pid: 565 and signal: SIGUSR1, core dumped? false
	 Starting clean up.
Error Release command failed, deployment aborted

…then I tried again, with no changes to the configuration, and it deployed, but with essentially the same behavior described previously:

Bumping this in case anyone from Fly has other ideas. I don’t believe this is an issue on our end and a 2+ minute outage makes continuous deployment impossible for us.

Hi @enaia

Is it possible that your existing containers are killed prematurely?
It looks like your current kill_timeout is set to 5 so your server might still be processing final requests when it’s killed forcibly due to the timeout. If so, increasing the timeout to 1 or 2 minutes might help.

Thanks for responding. I’d be happy to try a different setting for kill_timeout but it’s hard to see how this would be the cause of the behavior we’re seeing. The issue isn’t existing connections being killed or jobs being terminated, it’s that websocket connections from new clients, or reconnections from existing clients fail during a 2+ minute period while the new version is begin rolled out, making the application essentially unusable. I explained this in more detail in the first post in this thread, and included a link to a screen recording showing the behavior. We’re a very standard LiveView app.

…Update: tried it. The behavior may have been a bit different in that the “outage” seemed to start later, but the deployment process also lasted longer so the effect was still a couple of minutes during which new connections / reconnections failed.

Do 100% of new connections / reconnections fail within the those couple of minutes?

Also, how long does it typically take for your server to start and become ready to accept websocket connections? Are they reporting a “healthy” state before they are ready since you are using TCP checks instead of HTTP checks?

I’m also wondering if you’re possibly running into this issue:

Regarding the slow propagation issue, my understanding based on this is that scaling to 3 instances is the current workaround. We’ve done this already, as mentioned in my initial post.

Do 100% of new connections / reconnections fail within the those couple of minutes?

That seemed to be the case as of a couple of days ago, as you can see in the screen recording linked from my initial post. Each time I reloaded the page, the static page was served successfully and the websocket connection failed.

Also, how long does it typically take for your server to start and become ready to accept websocket connections? Are they reporting a “healthy” state before they are ready since you are using TCP checks instead of HTTP checks?

It’s a LIveView application, and static pages are being served successfully. In case you’re not familiar with the LiveView lifecycle, from the docs:

LiveView is first rendered statically as part of regular HTTP requests, which provides quick times for “First Meaningful Paint”, in addition to helping search and indexing engines. Then a persistent connection is established between client and server.

As far as I know, once the application is ready to serve static pages, it’s also ready to accept websocket connections. This is the behavior I’ve seen locally and when I’ve deployed on other infrastructure.

Any chance Chris McCord could take a look at this?

Can you get log messages from the app itself while this is happening? The failed websocket responses probably mean it’s not getting the the Phoenix process, but I’m curious.

I can replicate this somewhat reliably with a LiveView app. We may be doing a poor job of finding an available VM for a request with an upgrade header. We’re investigating the problem here, but I don’t have an answer yet.

I believe I’m seeing a very similar behaviour with my much simpler, non-liveview application. It seems that even after passing health checks the internal fly tcp load balancing takes about a couple of minutes to update its routes. When that happens, and the old VMs are shutting down, I can’t even connect to the public ip.

This is with bluegreen deployment. rolling hides it, or maybe makes it go away, hard to tell.

@enaia I believe we’ve found the core problem here. When an HTTP request includes an Upgrade header, we don’t do as good of job finding an available VM or retrying when a connection fails. So the websocket errors are just attempts to connect to a now-gone VM.

We may be able to fix this pretty quickly, I appreciate you debugging for us! This is a slightly different problem than the service propagation and solving it will make apps behave better. Fixing service propagation probably would have hidden the problem so we never would have known! :slight_smile:

2 Likes

@cedivad oh yes, this makes sense for bluegreen. The propagation delay is masked with a rolling deploy because old VMs stay alive long enough for the new VM to get registered everywhere.

Blue green deploys stop all the existing VMs at once. This is happening faster than the new VMs get registered. You may be able to workaround this with a shutdown_delay and some code to catch SIGINT and delay shutdown of the app. It’s gross as heck and you shouldn’t need to do that, but it might work.

1 Like

@enaia we deployed a fix that seems to solve this problem. When I deploy and watch liveview now, I see one websocket disconnect and then it reconnects to another VM.

Our service propagation delay is still slow, so I occasionally see it reconnect multiple times if it ends up on a soon-to-be-destroyed VM. But each connection succeeds almost instantly.

2 Likes

Excellent news, thanks for taking care of this! I’ll be able to try this out later today.

Hey @kurt, this may be going a bit off-topic.
Please reply if appropriate:

  1. For single-region orgs or orgs where most of the action is in a single region, will it help by scoping/prioritizing the service propagation in that region alone?
    Or would you have to solve it at a global scale (because that’s the true/right solution).

  2. Is this where you’d use the VM API to find out when the VM is being shut down/destroyed, to inform you of a VM’s state earlier, therefore avoiding extra requests to a soon-to-be-destroyed VM?

@kurt it seems to be better behaved but I don’t think I’d call this solved. See this video covering part of the deployment. I didn’t catch the beginning of the initial failure / reconnection cycle shown but it went on for a total of about 30 seconds. That wouldn’t be a great experience.

Separately, on another attempt it failed to load the JS file:

Do you think this is now all due to the slow service propagation issue? My understanding was that running with scale 3 is a solid workaround for now, and we are running with 3 instances.

Unfortunately, we had to roll back the change before you tried it, so you probably didn’t get the websocket fix. It’s deployed again now, though!

The 404 from the assets is probably due to the deploy. When asset fingerprints change, it’s somewhat common to get HTML with one fingerprint, and then have the request land on a VM with a different image. It’s not caused by the slow service propagation, but it is more obvious when it’s slow to remove old nodes and discover new ones.

2 Likes

Thanks @kurt, I just deployed and I’m no longer seeing the websocket problem!

3 Likes