App suspended, can't re-start or re-deploy it

An app that’s been running for month became suspended at ~11:01p ET last night.

Restarting the app errors:

$ fly apps restart uptime-dazzit-com
Restarting machine 1781965b946689
Error: could not stop machine 1781965b946689: failed to restart VM 1781965b946689: internal: ...

Restarting the machine errors:

$ fly machine restart 1781965b946689
Restarting machine 1781965b946689
Error: failed to restart machine 1781965b946689: could not stop machine 1781965b946689: failed to restart VM 1781965b946689: internal: internal server error

Re-deploying the app errors:

$ fly deploy
==> Verifying app config
Validating /Users/.../uptime-dazzit-com/fly.toml
Platform: machines
✓ Configuration is valid
--> Verified app config
==> Building image
Searching for image 'louislam/uptime-kuma:1' remotely...
image found: img_y7nxpkxm8lnv8w25

Watch your deployment at https://fly.io/apps/uptime-dazzit-com/monitoring

Updating existing machines in 'uptime-dazzit-com' with rolling strategy
  [1/1] Waiting for 1781965b946689 [app] to have state: started
Error: timeout reached waiting for machine to started failed to wait for VM 1781965b946689 in started state: Get "https://api.machines.dev/v1/apps/uptime-dazzit-com/machines/1781965b946689/wait?instance_id=01HAA23BHFET68Z9S1CMCNY95G&state=started&timeout=60": net/http: request canceled
You can increase the timeout with the --wait-timeout flag

Monitoring shows, among other things:

2023-09-14T14:39:50.724 proxy[1781965b946689] yyz [error] machine is in a non-startable state: created
2023-09-14T14:41:08.182 proxy[1781965b946689] lga [error] could not find a good candidate within 90 attempts at load balancing

Mmmm… help?

A little more info.

That app initially went down at ~22:54 EDT, came up at ~22:58 EDT, and went down again at ~23:01 EDT.

Another app that I have in the same region went down at ~23:04 EDT, came up at ~23:49 EDT, and stayed up.

By “down” I mean: didn’t respond to requests.

So whatever happened to the one app seems to be the result of something that happened in the region.

Ping?

hi @michaell

Could you try deleting that Machine with fly m destroy <machine id> --force and then run fly deploy again? This shouldn’t normally happen, but the Machine may be stuck in a weird state. Let us know if that doesn’t work.

Unfortunately, apps might have downtime if they only have one Machine and a host reboots or has issues. You can run two Machines and use auto start and stop make sure they only run when needed.

OK, I’ll look into that. But first I need to make sure that destroying a machine doesn’t destroy the attached volume. :slight_smile:

(And the attached volume is why there’s only one machine.)

Well, forcing the destroy worked. But the new machine didn’t use the existing volume, it created a new one. Now I’m trying to figure out how to attach an old volume to a new machine. Not obvious!

OK, I cloned the new machine, attaching the old volume in the process, and then destroyed the machine that was closed.

I’m back up.

Thank you!

Glad to hear it!
The new Machine should have picked up the “old” existing volume as long as it had the name specified in [mounts] source. But it’s possible it didn’t have time to sync up between being detached and the deploy… But in any case, the cloning was the next best thing. Happy that it worked!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.