Fly Machine becoming unresponsive and then stopping without explanation

I use machines to perform some on-demand workloads. They are awesome!

Today I ran into this though:

  1. start a machine with a specific command and it runs for some time
  2. keep polling machine status via API
  3. after a while (about 1h) the API call for machine status begins to time-out
  4. after the machine status API call times out about 10 times (10 minutes) the machine just dies with

2023-02-02T17:22:39.019 app[1781574cd25989] fra [warn] Virtual machine exited abruptly

After the first machine status API timeout I can see that metrics for this machine stop appearing in grafana.

Here are two machine IDs where this happened so far:
1781574cd25989
1781577b979378
app name: yarabot-machines

Please help!

Hi @honzasterba, thanks for reporting this and I see the API errors in our system. I’ll see what’s going wrong and hopefully get a fix out soon.

1 Like

Ok, turns out the underlying host became heavily loaded which resulted in a lot of i/o and network slowness. We’ve been working to apply better some limits on our hosts which should prevent this issue again but could also result in being unable to schedule a machine of the size you’re requesting (depending on the region).

Thanks!
As long as I get some reasonable error when resources are not available and can try later I will be fine.

I am not seeing the timeouts any more but the machine is still not being able to finish the work and finishing prematurely.

Last failed run: 4d89044b6d6948
Current attempt: 3d8dd14c702289

if you’re running this via fly machine run, can you try running with --kernel-arg="LOG_FILTER=debug"?

1 Like

I am starting the machine via API. Is there a way to do that via API?

ah, ok, the kernel_args field is in the guest config and is an array of string so the following should work:

{
  ...
  "guest": {
    ...
    "kernel_args": [ "LOG_FILTER=debug" ]
  }
  ...
}

Thanks! The last run seem to be not crashing yet, fingers crossed! I will keep this for when this happens again. Its probably not good idea to keep this option on for all jobs right?

I wouldn’t keep it for all jobs because it results in excessive logging and is primarily meant for helping us debug issues.

We’re pretty sure we’ve narrowed the issue down to the machine getting scheduled on very busy hosts and is getting sigkill’d due to memory constraints.

2 Likes

I am getting timeouts and machines dying abruptly again.

2023-02-05T15:59:11.690 app[3d8dd14c79e089] fra [warn] Virtual machine exited abruptly

This is a bug in our capacity limits. It should be erroring saying the machine can’t start with the volume, not exiting after the fact.

128GB won’t reliably work well with volumes. We’ll get the error fixed up, but I’m pretty sure what you’re doing is not optimal on Fly.

I ran the machine with the debug flags as instructed. I was a little worried it was getting killed because it ran out of memory, but I have had 5 runs crash at different times during the job, so its most definitely not because the machine is running out of memory but for some reason is getting killed anyways.

I can provide longer log output if necessary, but its probably of no value.

2023-02-05T19:24:53Z app[9080177b645698] fra [info]flushed 3471 bytes
2023-02-05T19:25:03Z app[9080177b645698] fra [info]parsed 4 headers
2023-02-05T19:25:03Z app[9080177b645698] fra [info]incoming body is empty
2023-02-05T19:25:03Z app[9080177b645698] fra [info]sysinfo: Ok(SysInfo { memory: Memory { mem_total: 135087685632, mem_free: 62299095040, mem_available: Some(61487677440), buffers: 15601664, cached: 92086272, swap_cached: 0, active: 33931264, inactive: 72519397376, swap_total: 0, swap_free: 0, dirty: 0, writeback: 0, slab: 43261952, shmem: Some(13799424), vmalloc_total: 35184372087808, vmalloc_used: 8011776, vmalloc_chunk: 0 }, load_average: [1.0, 1.0, 1.0], cpus: {8: Cpu { user: 5678, nice: 0, system: 256, idle: 306842, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(162), guest: Some(0), guest_nice: Some(0) }, 9: Cpu { user: 4111, nice: 0, system: 199, idle: 308518, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(123), guest: Some(0), guest_nice: Some(0) }, 15: Cpu { user: 599, nice: 0, system: 72, idle: 312264, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(20), guest: Some(0), guest_nice: Some(0) }, 7: Cpu { user: 5540, nice: 0, system: 349, idle: 306921, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(145), guest: Some(0), guest_nice: Some(0) }, 4: Cpu { user: 32991, nice: 0, system: 1194, idle: 276840, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(1000), guest: Some(0), guest_nice: Some(0) }, 6: Cpu { user: 30267, nice: 0, system: 814, idle: 281195, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(676), guest: Some(0), guest_nice: Some(0) }, 11: Cpu { user: 12816, nice: 0, system: 538, idle: 299301, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(294), guest: Some(0), guest_nice: Some(0) }, 3: Cpu { user: 30026, nice: 0, system: 1279, idle: 280881, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(728), guest: Some(0), guest_nice: Some(0) }, 12: Cpu { user: 1528, nice: 0, system: 78, idle: 311303, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(48), guest: Some(0), guest_nice: Some(0) }, 2: Cpu { user: 47164, nice: 0, system: 1951, idle: 262734, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(1096), guest: Some(0), guest_nice: Some(0) }, 13: Cpu { user: 656, nice: 0, system: 51, idle: 312225, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(23), guest: Some(0), guest_nice: Some(0) }, 0: Cpu { user: 11937, nice: 0, system: 629, idle: 299795, iowait: Some(81), irq: Some(0), softirq: Some(28), steal: Some(443), guest: Some(0), guest_nice: Some(0) }, 1: Cpu { user: 71953, nice: 0, system: 2700, idle: 236502, iowait: Some(0), irq: Some(0), softirq: Some(55), steal: Some(1662), guest: Some(0), guest_nice: Some(0) }, 5: Cpu { user: 30850, nice: 0, system: 1111, idle: 280264, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(713), guest: Some(0), guest_nice: Some(0) }, 14: Cpu { user: 6962, nice: 0, system: 368, idle: 305449, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(176), guest: Some(0), guest_nice: Some(0) }, 10: Cpu { user: 17, nice: 0, system: 50, idle: 312880, iowait: Some(0), irq: Some(0), softirq: Some(0), steal: Some(7), guest: Some(0), guest_nice: Some(0) }}, processes: 139, disks: [DiskStat { name: "vda", reads_completed: 8495, reads_merged: 0, sectors_read: 155258, time_reading: 2548, writes_completed: 3527, writes_merged: 0, sectors_written: 46240, time_writing: 644, io_in_progress: 0, time_io: 5268, time_io_weighted: 3193 }], filesystems: [FileSystemStat { mount: "/", blocks: 2047854, block_size: 4096, blocks_free: 1517143, blocks_avail: 1408190 }], net: [NetworkDevice { name: "dummy0", recv_bytes: 0, recv_packets: 0, recv_errs: 0, recv_drop: 0, recv_fifo: 0, recv_frame: 0, recv_compressed: 0, recv_multicast: 0, sent_bytes: 0, sent_packets: 0, sent_errs: 0, sent_drop: 0, sent_fifo: 0, sent_colls: 0, sent_carrier: 0, sent_compressed: 0 }, NetworkDevice { name: "eth0", recv_bytes: 233005555, recv_packets: 48509, recv_errs: 0, recv_drop: 0, recv_fifo: 0, recv_frame: 0, recv_compressed: 0, recv_multicast: 0, sent_bytes: 1656684, sent_packets: 24882, sent_errs: 0, sent_drop: 0, sent_fifo: 0, sent_colls: 0, sent_carrier: 0, sent_compressed: 0 }], filefd: FileFd { allocated: 32, maximum: 13188943 } })
2023-02-05T19:25:03Z app[9080177b645698] fra [info]flushed 3472 bytes
2023-02-05T19:26:17Z runner[9080177b645698] fra [info]machine restart policy set to 'no', not restarting

Oops, I misread your app config. It’s still the same bug, the hosts your machines are landing on don’t have enough free memory for the 128GB you’re requesting. We should be erroring when this happens, but it’s letting it launch instead.

Well, I was hoping fly machines would fix my problem of needing to provision a fairly large machine on demand, run a task and bill only for the time used.

From what you are saying I gather you have a pretty serious issue, if I were to provision a 128gb Postgres machine, am I in the risk it would suddenly die if it reached some threshold of memory because the underlying machine would not posses such RAM?

Yes we’re working on the bug. Postgres gets provisioned a little differently, but it’s theoretically possible to hit this. You’re actually the only person who’s run into this particular bug, and it’s a combination of region + Machine size + create pattern that seems to be causing it.

Is it a valid work-around to use a different region? Is there any other work-around?

I wouldn’t suggest using Fly for short lived 128GB RAM machines. You’ll get errors when we get this bug fixed, but you won’t be able to get them reliably in many places.

I pretty much do not care where the machine is executed, so if there is a reliable zone I can use I will be happy. (I have tried forcing den and it seems to work)
I have not found another FaaS provider that would allow 128G machines on demand so that’s why I am using fly.