Single VM deploy causes temporary unavailability

eric-karambit-ai · January 21, 2022, 8:19pm

I have an Elixir app that I am developing and have a single app instance deployment using the default canary strategy.

I’m observing the following order of operations after running fly deploy:

Health checks pass for new instance
Old instance is shut down

2022-01-21T19:55:06.720 app[5b9740cd] ewr [info]Sending signal SIGTERM to main child process w/ PID 510
2022-01-21T19:55:06.720 app[5b9740cd] ewr [info]19:55:06.720 [notice] SIGTERM received - shutting down

Connections continue routing to the old instance for a time

error.message="problem connecting to app instance" 2022-01-21T19:55:07.403 proxy[5b9740cd] ewr [error]error.code=2000 
error.message="problem connecting to app instance" 2022-01-21T19:55:08.046 proxy[5b9740cd] ewr [error]error.code=2000 
...
error.message="App connection timed out" 2022-01-21T19:55:12.962 proxy[5b9740cd] ewr [error]error.code=2001 
error.message="App connection timed out" 2022-01-21T19:55:15.192 proxy[5b9740cd] ewr [error]error.code=2001 
...
error.message="Internal problem" 2022-01-21T19:55:15.678 proxy[5b9740cd] ewr [error]error.code=2 
error.message="Internal problem" 2022-01-21T19:55:15.781 proxy[5b9740cd] ewr [error]error.code=2 
error.message="Internal problem" 2022-01-21T19:55:15.960 proxy[5b9740cd] ewr [error]error.code=2

Visits to the site in this short window all time out with issues
Deployment eventually completes and requests all go to the new instance

Reading around the forum, it seems like this might be related to @kurt’s comment in the following issue about slow service propagation.

Does this seem like the same issue? Is there anything I can do to debug further?

Thanks!

kurt · January 21, 2022, 8:53pm

Yes that’s almost definitely the same issue. We are so close to having this fixed, so the answer might be “just bear with us”. But the workaround answer is that running 3 nodes will likely mask the problem.

eric-karambit-ai · January 21, 2022, 8:55pm

Cool thank you @kurt. I’ll give that a shot. Is there anywhere I can check back in when you guys have fixed that?

kurt · January 21, 2022, 8:55pm

I will try to make a point to reply here. We will make a HUGE deal of it in the forums when we get it solved. So check back here occasionally?

eric-karambit-ai · January 21, 2022, 8:56pm

Will do, thanks!

kurt · February 23, 2022, 2:44am

Give it a try now, it should be much better.

eric-karambit-ai · February 23, 2022, 3:27am

Was running the following to check the deployment.
Running my application with a scale count of 3.
Also running my application behind Cloudflare.

while true; do curl -s -o /dev/null -w "%{http_code}\n" https://karambit.ai; sleep 1; done

Previously, I would get tons of 525 errors as Cloudflare would be unable to reach the backend due to the service discovery issue and take some time to get healthy.

This time around, I got consistent 200s as I did my canary deploy as I would expect.

Looks fantastic, thanks for the update @kurt and kudos to the team for the fix

guims767 · February 23, 2022, 6:35pm

Massive thanks, I’ve been having these deployment downtime issues even with 2 instances and bluegreen strategy to mitigate it a bit. It was a bit of a pain.

Today I didn’t have any downtime while deploying I also reverted our app’s deployment strategy to canary as it works equally as well.

Topic		Replies	Views
reoccurring error - could not find an instance to route to Phoenix proxy	5	827	December 22, 2022
SIGTERM sent twice, not respecting kill_timeout, then "Virtual machine exited abruptly" Questions / Help	1	271	October 9, 2023
Application VMs down without any change, can't deploy Phoenix	16	1319	October 3, 2022
Unable to deploy basic Elixir Phoenix app Build debugging	10	1244	June 27, 2021
Elixir/Redix Timeouts on Connection Questions / Help	5	391	November 29, 2022

Single VM deploy causes temporary unavailability

Related topics