What would cause blank logs after 'Starting virtual machine' (only with a volume attached)?

I’m trying to run an app in my personal Fly instance with a volume using the below fly.toml, testing out resiliency as we try to use it for our organization (also on my account, you can see it not scaled up yet).

Putting a volume on the app seems to reliably prevent it from allocating (I get ‘unhealthy allocations’), with no other changes to the config as far as I can tell. I’ve tried this app in both EWR and IAD (trying a volume on each, it’s just a small, simple 3GB volume).

I’ve also managed to see LVM errors in the fly deploy output (a ‘file already exists’ from LVM leaked through) in switching from EWR to IAD and then back – but I’m pretty sure these started /after/ it failed to allocate, not before?

The thing that’s causing the most trouble is that the logs all look like this when the volume is attached – no output after ‘Starting virtual machine’ – if I simply remove the volume and try again, I get actual logs and a running machine, albeit with no volume:

After a roundtrip through fly config save the fly.toml looks like the paste below (I’ve had no luck getting rid of services.concurrency).

Since I’ve been able to get the app to start and stay up simply by commenting out the [[mounts]] section, the various remedies in this forum around unhealthy allocations don’t seem to make all that much sense.

Any thoughts at all on what would cause this?

With the mounts section commented out:

fly.toml:

app = "quiet-datomic"
kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]

[experimental]
  allowed_public_ports = []
  auto_rollback = false
  private_network = true

[[mounts]]
  destination = "/run/datomic"
  encrypted = false
  source = "run"

[[services]]
  http_checks = []
  internal_port = 4334
  ports = []
  processes = ["app"]
  protocol = "tcp"
  script_checks = []
  tcp_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

Hmm :thinking: Certainly sounds like it is related to the volume.

The only thing that stands out to me is that encrypted = false since that’s not a documented value. I know volumes are created encrypted-at-rest so if your volume is encrypted but your fly.toml says it’s not, perhaps that could cause the vm to fail to start? Or perhaps your volume is in fact not encrypted and so it’s nothing to do with that!

The encrypted = false stood out to me too and I hadn’t yet had a chance to try out changing that. I also wondered if it was trying to mount an encrypted volume unencrypted and locking up.

I actually created it without that value, originally writing:

[[mounts]]
  source = "run"
  destination = "/run/datomic"

But then when I pushed it up and then round-tripped it by running fly config save to make sure that I was getting back the ‘true configuration’ that was in use by the app, the encrypted = false appeared by itself.

I actually noticed the same on a vanilla fly postgres app on which I ran fly config save the other day — even though the Postgres app always says ‘loading encrypted volume’ (or something) in logs when starting up, the saved config had that encrypted = false value.

I think the issue existed before I pushed a version with that explicitly in there, but let me try modifying it and see if that fixes anything.

Nope, no dice I’m afraid @greg. I changed it to encrypted = true and that made no difference.

Any thoughts at all on how else I could proceed here?

Ah :frowning:

While waiting for someone from Fly, you could definitively double-check the encryption status by using fly volumes show THEIDHERE (passing the ID of your volume). And that should show if it is actually encrypted. Not sure why Fly adds false to your toml if it is, but again, that’s not a documented value so I don’t know how its innards work as regards that.

Also, is the volume being mounted in the same region you are deploying the app to? I assume so but doesn’t hurt to check.

That’s really my only thought I’m afraid as what you have looks correct to me (as an outside observer).

So I decided to go back to the drawing board, destroying the entire app and then creating a new one, deploying it using this fly.toml under a new name, even:

app = "quiet-db"
kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]

[experimental]
  allowed_public_ports = []
  auto_rollback = false
  private_network = true

[mounts]
  destination = "/run/datomic"
  source = "run"

[[services]]

I deployed it one time with the mounts commented out, then created a volume, created the mounts section and then deployed it – and I received this error:

Failed Instances

Failure #1

Instance
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS	HEALTH CHECKS	RESTARTS	CREATED 
a704e28a	      	2      	ewr   	run    	failed	            	0       	24s ago	

Recent Events
TIMESTAMP           	TYPE           	MESSAGE                                                                                                                                                                                                                                   
2022-08-16T20:11:21Z	Received       	Task received by client                                                                                                                                                                                                                  	
2022-08-16T20:11:38Z	Task Setup     	Building Task Directory                                                                                                                                                                                                                  	
2022-08-16T20:11:49Z	Driver Failure 	rpc error: code = Unknown desc = unable to start microvm: link /opt/nomad/alloc/a704e28a-035e-41b7-c4cc-a02d0c5dd16c/default/run /opt/nomad/alloc/a704e28a-035e-41b7-c4cc-a02d0c5dd16c/default/firecracker/a704e28a/root/run: file exists	
2022-08-16T20:11:49Z	Not Restarting 	Error was unrecoverable                                                                                                                                                                                                                  	
2022-08-16T20:11:49Z	Alloc Unhealthy	Unhealthy because of failed task             

Hopefully this will help the folks are Fly figure this out! I’m not sure it’s really in the realm of things that we can fix from the outside?

Just to add to the pile of debugging options tried:

  • creating an entirely new organization with a new app name and trying this out also doesn’t work.
  • changing the Dockerfile to FROM alpine:latest CMD ["sleep", "100000"] also hangs on Starting

I managed to get things working with this fly.toml below.

I’d love any sort of explanation of what might have caused this issue from the Fly side, so that we can avoid tripping over this issue in the future – if it failed loudly it would have been alright, but the ‘silent’ failure by hanging made this really hard to debug in practice.

The two things that are different here are the services section… existing and the volume being renamed from ‘run’ to ‘datomic’.

app = "quiet-datomic"
kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]

[experimental]
  allowed_public_ports = []
  auto_rollback = false

[mounts]
  destination = "/run/datomic"
  source = "datomic"

[[services]]
  http_checks = []
  internal_port = 4334
  processes = ["app"]
  protocol = "tcp"
  script_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

Thanks for getting back to us with your fix! I think I might have an idea about what caused this for you. Volumes are mounted by name-- it’s possible that calling the source ‘run’ conflicted with the existing /run directory! I was able to reproduce this on a basic test app, and renaming it also resolved the issue.

2 Likes

Hahaha suspecting that ‘run’ might be reserved was the final prompt, so I’m glad to know that that was all it was.

I hope this check can make it into config validation or similar, but that’s easy enough to avoid for now!

1 Like