Slow deploys, causing outages?

Hi,

New to Fly.io here, so I may be doing something entirely wrong, but I’m trying to run a barebones PHP example:

fly.toml

app = "php-swoole-test"

kill_signal = "SIGINT"
kill_timeout = 5

[[services]]
  internal_port = 9501
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 30
    soft_limit = 25

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    timeout = "1s"

Dockerfile:

FROM phpswoole/swoole:php8.0-alpine
COPY app.php .
EXPOSE 9501
ENTRYPOINT ["php", "app.php"]

app.php

<?php

$server = new Swoole\HTTP\Server('0.0.0.0', 9501, SWOOLE_BASE);

$server->on("request", function (Swoole\Http\Request $request, Swoole\Http\Response $response) {
    $response->header("Content-Type", "text/plain");
    $response->end("Hello");
});

$server->start();

^ This above simple application works fine once it’s deployed and has some time to settle. However doing flyctl deploy results in some strange behavior:

  • Once the healthchecks on the new version pass, it still waits about 30 seconds prior to sending SIGINT to the older VM
  • During this deploy I cannot access the endpoint (simply via http://ip-address) and often it seems to take 30-120s before I can successfully hit the service afterward. I’m seeing the same the problem in both sea and ord regions (scale=1, single VM deployments).

Notice a similar thing. Using Nodejs.

Does Fly take a minute before the app is reachable after a deploy?

If this is something that is common, I would love to know so I can make sure to only push code at non peak times.

What are the scaling settings in these cases? fly scale show and fly autoscale show should give you that info.

There are some cases where the min and max can prevent clean rollover — if the min=max=1 for instance, I’d expect to see some downtime because I’ve asked Fly to make sure only one server is running at a time.

App service information is slow to propagate to all of our edge proxies. We’re working on this, but it’s a big hairy architectural problem.

When you’re running an app with a single VM, this can slow requests or 502s during a deploy. The first VM will go away, our edge proxies won’t all know that, and they won’t have seen the new VM yet. If you’re running two VMs, you probably won’t have this issue – but only because it takes longer for both old VMs to go away and gives the edge proxies time to “see” the new ones.

@kurt okay, thanks, I’ll try it out with >1 scaling VMs and various deployment strategies too. If eventually there’s a deployment strategy we can select that’s contingent on external edge proxies picking up the new VMs before tearing down the prior, that would be awesome too.

We’re absolutely going to fix that, for what it’s worth, it’s just a very hard problem and it’s taking us some time.