Slow deploys, causing outages?

Tim_Elliott · November 29, 2021, 10:48pm

Hi,

New to Fly.io here, so I may be doing something entirely wrong, but I’m trying to run a barebones PHP example:

fly.toml

app = "php-swoole-test"

kill_signal = "SIGINT"
kill_timeout = 5

[[services]]
  internal_port = 9501
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 30
    soft_limit = 25

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    timeout = "1s"

Dockerfile:

FROM phpswoole/swoole:php8.0-alpine
COPY app.php .
EXPOSE 9501
ENTRYPOINT ["php", "app.php"]

app.php

<?php

$server = new Swoole\HTTP\Server('0.0.0.0', 9501, SWOOLE_BASE);

$server->on("request", function (Swoole\Http\Request $request, Swoole\Http\Response $response) {
    $response->header("Content-Type", "text/plain");
    $response->end("Hello");
});

$server->start();

^ This above simple application works fine once it’s deployed and has some time to settle. However doing flyctl deploy results in some strange behavior:

Once the healthchecks on the new version pass, it still waits about 30 seconds prior to sending SIGINT to the older VM
During this deploy I cannot access the endpoint (simply via http://ip-address) and often it seems to take 30-120s before I can successfully hit the service afterward. I’m seeing the same the problem in both sea and ord regions (scale=1, single VM deployments).

user89 · November 30, 2021, 5:22am

Notice a similar thing. Using Nodejs.

Does Fly take a minute before the app is reachable after a deploy?

If this is something that is common, I would love to know so I can make sure to only push code at non peak times.

sudhir.j · November 30, 2021, 9:34am

What are the scaling settings in these cases? fly scale show and fly autoscale show should give you that info.

There are some cases where the min and max can prevent clean rollover — if the min=max=1 for instance, I’d expect to see some downtime because I’ve asked Fly to make sure only one server is running at a time.

kurt · November 30, 2021, 11:32pm

App service information is slow to propagate to all of our edge proxies. We’re working on this, but it’s a big hairy architectural problem.

When you’re running an app with a single VM, this can slow requests or 502s during a deploy. The first VM will go away, our edge proxies won’t all know that, and they won’t have seen the new VM yet. If you’re running two VMs, you probably won’t have this issue – but only because it takes longer for both old VMs to go away and gives the edge proxies time to “see” the new ones.

Tim_Elliott · December 1, 2021, 4:38am

@kurt okay, thanks, I’ll try it out with >1 scaling VMs and various deployment strategies too. If eventually there’s a deployment strategy we can select that’s contingent on external edge proxies picking up the new VMs before tearing down the prior, that would be awesome too.

kurt · December 1, 2021, 10:22pm

We’re absolutely going to fix that, for what it’s worth, it’s just a very hard problem and it’s taking us some time.

Topic		Replies	Views
slow requests during deploys	14	990	September 2, 2021
Is downtime expected post app deploy? Questions / Help	9	1199	February 2, 2022
Fly Not Scaling?	2	399	February 3, 2021
kill_timeout not working as expected JavaScript docs , nodejs , flyctl	5	712	August 30, 2023
flyctl deploy --wait-timeout switch; does it work? Questions / Help appsv2	3	1199	June 8, 2023

Slow deploys, causing outages?

Related topics