I’m trying to run an app in my personal Fly instance with a volume using the below fly.toml, testing out resiliency as we try to use it for our organization (also on my account, you can see it not scaled up yet).
Putting a volume on the app seems to reliably prevent it from allocating (I get ‘unhealthy allocations’), with no other changes to the config as far as I can tell. I’ve tried this app in both EWR and IAD (trying a volume on each, it’s just a small, simple 3GB volume).
I’ve also managed to see LVM errors in the fly deploy output (a ‘file already exists’ from LVM leaked through) in switching from EWR to IAD and then back – but I’m pretty sure these started /after/ it failed to allocate, not before?
The thing that’s causing the most trouble is that the logs all look like this when the volume is attached – no output after ‘Starting virtual machine’ – if I simply remove the volume and try again, I get actual logs and a running machine, albeit with no volume:
After a roundtrip through fly config save the fly.toml looks like the paste below (I’ve had no luck getting rid of services.concurrency).
Since I’ve been able to get the app to start and stay up simply by commenting out the [[mounts]] section, the various remedies in this forum around unhealthy allocations don’t seem to make all that much sense.
Hmm Certainly sounds like it is related to the volume.
The only thing that stands out to me is that encrypted = false since that’s not a documented value. I know volumes are created encrypted-at-rest so if your volume is encrypted but your fly.toml says it’s not, perhaps that could cause the vm to fail to start? Or perhaps your volume is in fact not encrypted and so it’s nothing to do with that!
The encrypted = false stood out to me too and I hadn’t yet had a chance to try out changing that. I also wondered if it was trying to mount an encrypted volume unencrypted and locking up.
I actually created it without that value, originally writing:
But then when I pushed it up and then round-tripped it by running fly config save to make sure that I was getting back the ‘true configuration’ that was in use by the app, the encrypted = false appeared by itself.
I actually noticed the same on a vanilla fly postgres app on which I ran fly config save the other day — even though the Postgres app always says ‘loading encrypted volume’ (or something) in logs when starting up, the saved config had that encrypted = false value.
I think the issue existed before I pushed a version with that explicitly in there, but let me try modifying it and see if that fixes anything.
While waiting for someone from Fly, you could definitively double-check the encryption status by using fly volumes show THEIDHERE (passing the ID of your volume). And that should show if it is actually encrypted. Not sure why Fly adds false to your toml if it is, but again, that’s not a documented value so I don’t know how its innards work as regards that.
Also, is the volume being mounted in the same region you are deploying the app to? I assume so but doesn’t hurt to check.
That’s really my only thought I’m afraid as what you have looks correct to me (as an outside observer).
So I decided to go back to the drawing board, destroying the entire app and then creating a new one, deploying it using this fly.toml under a new name, even:
I deployed it one time with the mounts commented out, then created a volume, created the mounts section and then deployed it – and I received this error:
Failed Instances
Failure #1
Instance
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
a704e28a 2 ewr run failed 0 24s ago
Recent Events
TIMESTAMP TYPE MESSAGE
2022-08-16T20:11:21Z Received Task received by client
2022-08-16T20:11:38Z Task Setup Building Task Directory
2022-08-16T20:11:49Z Driver Failure rpc error: code = Unknown desc = unable to start microvm: link /opt/nomad/alloc/a704e28a-035e-41b7-c4cc-a02d0c5dd16c/default/run /opt/nomad/alloc/a704e28a-035e-41b7-c4cc-a02d0c5dd16c/default/firecracker/a704e28a/root/run: file exists
2022-08-16T20:11:49Z Not Restarting Error was unrecoverable
2022-08-16T20:11:49Z Alloc Unhealthy Unhealthy because of failed task
Hopefully this will help the folks are Fly figure this out! I’m not sure it’s really in the realm of things that we can fix from the outside?
Just to add to the pile of debugging options tried:
creating an entirely new organization with a new app name and trying this out also doesn’t work.
changing the Dockerfile to FROM alpine:latest CMD ["sleep", "100000"] also hangs on Starting
I managed to get things working with this fly.toml below.
I’d love any sort of explanation of what might have caused this issue from the Fly side, so that we can avoid tripping over this issue in the future – if it failed loudly it would have been alright, but the ‘silent’ failure by hanging made this really hard to debug in practice.
The two things that are different here are the services section… existing and the volume being renamed from ‘run’ to ‘datomic’.
Thanks for getting back to us with your fix! I think I might have an idea about what caused this for you. Volumes are mounted by name-- it’s possible that calling the source ‘run’ conflicted with the existing /run directory! I was able to reproduce this on a basic test app, and renaming it also resolved the issue.