Hi @Yaeger, I know you’ve also been in touch through email support but I wanted to follow up here on the issues you reported recovering your app after a single host in ams
became unavailable.
First, it looks like you were running an app with one Machine and an attached Volume, which can only be recovered by restoring a daily snapshot to a new volume and re-deploying. This is why we strongly recommend against a single Volume when running any app storing important data, or that needs to be highly-available (there are warnings all over our CLI and dashboard). It’s only okay for a couple very limited use-cases:
In a few cases, you can run a single Machine with an attached volume. For example, if your app is in development and you’re not yet worried about downtime or if you’re running an app that can handle downtime and has a custom backup procedure.
When a host is unavailable, you should expect 408 timeout errors for any Machine/Volume API operations. However, to help cleanup these resources after recovery, you can force-destroy a Machine on an unavailable host with fly machine destroy --force
- see Minimizing Impact of Dead Hosts: New Features and Recovery Techniques for more info about this new option.
The fly scale count
error you encountered was a bug that caused scale count
to fail on apps with unavailable Machines, and the opaque error message (“Oops, something went wrong!”) was not at all helpful for figuring out what was causing the failure. We’ve now published fixes for both of these issues - #3923 (fixing the regression in fly scale count
) was published in v0.2.127 and #3850 (showing stack traces for flyctl errors) will be included in the next release.