Auto stopping machines won't start after bluegreen deployment, even though deployment was successful

hilja · May 22, 2025, 6:46am

Can bluegreen deployments be used with machines with auto stop machines?

I’ve got a bunch of big machines that run heavy tasks relatively rarely, so I have the following config

[[services]]
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 0

I just tried to switch to bluegreen deployments and configured some healthchecks (notice no interval specified):

[[services.http_checks]]
    grace_period = "3s"
    path = "/healthcheck"
    timeout = "3s"
    tls_skip_verify = false

According to the logs from my CI, the deployment goes swimmingly and the health check passes:

Creating green machines
  Created machine 568325d5cd1958 [app]
  Created machine e7840416aeed98 [app]
  Created machine e286ed95c754e8 [app]

Waiting for all green machines to start
  Machine 568325d5cd1958 [app] - started
  Machine e286ed95c754e8 [app] - started
  Machine e7840416aeed98 [app] - started

Waiting for all green machines to be healthy
  Machine 568325d5cd1958 [app] - 1/1 passing
  Machine e286ed95c754e8 [app] - 1/1 passing
  Machine e7840416aeed98 [app] - 1/1 passing

Marking green machines as ready
  Machine e7840416aeed98 [app] now ready
  Machine 568325d5cd1958 [app] now ready
  Machine e286ed95c754e8 [app] now ready

Checkpointing deployment, this may take a few seconds...

Waiting before cordoning all blue machines
  Machine 4d899736f45498 [app] cordoned
  Machine 56832654f11198 [app] cordoned
  Machine 3d8d4529f23228 [app] cordoned

Waiting before stopping all blue machines

Stopping all blue machines

Waiting for all blue machines to stop
  Machine 3d8d4529f23228 [app] - stopped
  Machine 4d899736f45498 [app] - stopped
  Machine 56832654f11198 [app] - stopped

Destroying all blue machines
  Machine 3d8d4529f23228 [app] destroyed
  Machine 4d899736f45498 [app] destroyed
  Machine 56832654f11198 [app] destroyed

Deployment Complete

But the dashboard shows the machines are not started. Is it a subsequent check that fails, event thought I have not specified interval?

servicecheck-00-http-8080 	warning 	the machine hasn't started

I’ve tried to set interval to 0 but apparently minimum is 2s, I also set it to a year 525600m but the machines still remain stopped with the same warning.

After a release I don’t see helthcheck request in logs, but if I manually start the machine I see it. But when the machine shuts down it shows the “the machine hasn’t started” warning and refuses to auto start.

Can I do one check in the beginning at that’s it?

pavel · May 22, 2025, 10:04am

Hey @hilja

Are you using fly-replay to a specific instance or fly-force-instance-id request header, by any chance?

I’m looking through the proxy logs for your app and the proxy complains it can find specific instance by ID, as the instance is already destroyed.

hilja · May 22, 2025, 11:00am

You’re right, I am hard-coding the machineId in fly-force-instance-id because of unique use-case.

I didn’t realize immadiately that bluegreen will boot up new machines with new ids. I’m now treating the ids as ephemeral data, and the servers do restarts when I hit them with a request

But the dashboard still shows a servicecheck-00-http-8080 warning on each machine. I guess it’s just a warning that can be ignored. I have the interval still set to a year. I would gladly disable the checks on machines that have a short lifespan, but it’s needed for the bluegreen.

Also unsure why it hits with 5 health checks in the span of few seconds.

hilja · May 22, 2025, 11:33am

Reading from the bottom, I assume the first item is the automatic shutdown signal.

How come it then tries to restart the system reboot: Restarting system? After that the health check fails which causes the warning in the dashboard, I assume.

2025-05-22 12:48:05.625	Health check on port 8080 has failed. Your app is not responding properly.
2025-05-22 12:48:05.625	Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2025-05-22 12:48:04.315	[308.528332] reboot: Restarting system
2025-05-22 12:48:04.314	WARN could not unmount /rootfs: EINVAL: Invalid argument
2025-05-22 12:48:04.314	INFO Starting clean up.
2025-05-22 12:48:04.297	INFO Main child exited normally with code: 0
2025-05-22 12:48:03.454	INFO Sending signal SIGINT to main child process w/ PID 656
2025-05-22 12:48:03.449	App lgh has excess capacity, autostopping machine 3d8d4292c20328. 0 out of 1 machines left running (region=ams, process group=app)

pavel · May 22, 2025, 12:54pm

It doesn’t try to restart it. The message is always “reboot: Restarting system” even if the machine is being stopped. Whether or not to start the machine again is decided by the orchestrator based on the exit code, machine itself doesn’t really know it. In your case it’s not getting started back as exit code is 0.

Health checks fail because the machine is stopped. We still run health checks for stopped machines, AFAIR, though this may change in the future. It doesn’t really matter for autostart, as healthcheck status is ignored for stopped machines.

hilja · May 22, 2025, 1:08pm

Good to know thanks.

Got it, good to know it’s working as intended. I’ll ignore the warnings.

Topic		Replies	Views
Machine doesn't start on deployment	6	516	November 17, 2023
New blue-green deployments failing - machines never passing healthchecks	23	185	January 23, 2025
When a bluegreen deploy fails I end up with extra machines Build debugging machines	11	118	July 27, 2024
health-check failing during blue-green deployment elixir	3	262	May 9, 2024
Old instance stopped before new one is healthy on bluegreen deploy	2	348	December 7, 2022

Auto stopping machines won't start after bluegreen deployment, even though deployment was successful

Related topics