Single VM deploy causes temporary unavailability

I have an Elixir app that I am developing and have a single app instance deployment using the default canary strategy.

I’m observing the following order of operations after running fly deploy:

  1. Health checks pass for new instance
  2. Old instance is shut down
2022-01-21T19:55:06.720 app[5b9740cd] ewr [info]Sending signal SIGTERM to main child process w/ PID 510
2022-01-21T19:55:06.720 app[5b9740cd] ewr [info]19:55:06.720 [notice] SIGTERM received - shutting down
  1. Connections continue routing to the old instance for a time
error.message="problem connecting to app instance" 2022-01-21T19:55:07.403 proxy[5b9740cd] ewr [error]error.code=2000 
error.message="problem connecting to app instance" 2022-01-21T19:55:08.046 proxy[5b9740cd] ewr [error]error.code=2000 
...
error.message="App connection timed out" 2022-01-21T19:55:12.962 proxy[5b9740cd] ewr [error]error.code=2001 
error.message="App connection timed out" 2022-01-21T19:55:15.192 proxy[5b9740cd] ewr [error]error.code=2001 
...
error.message="Internal problem" 2022-01-21T19:55:15.678 proxy[5b9740cd] ewr [error]error.code=2 
error.message="Internal problem" 2022-01-21T19:55:15.781 proxy[5b9740cd] ewr [error]error.code=2 
error.message="Internal problem" 2022-01-21T19:55:15.960 proxy[5b9740cd] ewr [error]error.code=2 
  1. Visits to the site in this short window all time out with issues

  2. Deployment eventually completes and requests all go to the new instance

Reading around the forum, it seems like this might be related to @kurt’s comment in the following issue about slow service propagation.

Does this seem like the same issue? Is there anything I can do to debug further?

Thanks!

Yes that’s almost definitely the same issue. We are so close to having this fixed, so the answer might be “just bear with us”. But the workaround answer is that running 3 nodes will likely mask the problem.

1 Like

Cool thank you @kurt. I’ll give that a shot. Is there anywhere I can check back in when you guys have fixed that?

I will try to make a point to reply here. We will make a HUGE deal of it in the forums when we get it solved. So check back here occasionally?

3 Likes

Will do, thanks!

Give it a try now, it should be much better. :smiley:

3 Likes

Was running the following to check the deployment.
Running my application with a scale count of 3.
Also running my application behind Cloudflare.

while true; do curl -s -o /dev/null -w "%{http_code}\n" https://karambit.ai; sleep 1; done

Previously, I would get tons of 525 errors as Cloudflare would be unable to reach the backend due to the service discovery issue and take some time to get healthy.

This time around, I got consistent 200s as I did my canary deploy as I would expect.

Looks fantastic, thanks for the update @kurt and kudos to the team for the fix

Massive thanks, I’ve been having these deployment downtime issues even with 2 instances and bluegreen strategy to mitigate it a bit. It was a bit of a pain.

Today I didn’t have any downtime while deploying :star_struck: I also reverted our app’s deployment strategy to canary as it works equally as well.