Virtio-balloon overcommit causing severe memory thrashing on 8-CPU / 8GB Firecracker VMs

We’ve been experiencing severe performance degradation on several of our Fly.io Firecracker VM instances (8 vCPU / 8 GB RAM). After extensive investigation, we’ve identified the root cause: the virtio-balloon is inflating to ~80% of guest RAM, leaving only 1.4–2 GB usable out of 8 GB.

Summary

  • VM spec: 8 vCPU (AMD EPYC), 8 GB RAM, no swap

  • Symptom: Load average spikes to 20–30+, all application requests degraded to 30–100s response times

  • Root cause: virtio-balloon inflates to ~6.5 GB, leaving insufficient memory for page cache. This triggers a kswapd thrashing loop with ~2 TB of disk re-reads in just 2 hours.

Evidence

Balloon inflation (from /proc/vmstat)

Affected instance:

balloon_inflate:  1,708,800 pages (6,683 MB given to host)
balloon_deflate:     37,632 pages (  147 MB returned)
NET inflated:     1,671,168 pages = 6,528 MB stolen from guest

Healthy instance for comparison:

balloon_inflate:  1,529,600 pages
balloon_deflate:        512 pages
NET inflated:     1,529,088 pages = 5,973 MB stolen from guest

Both instances show 75–82% of VM RAM taken by the balloon. The “healthy” one only survives because it has slightly more headroom (1,968 MB vs 1,413 MB usable).

Memory accounting

Metric Affected VM Healthy VM
MemTotal 7,941 MB 7,941 MB
Balloon stolen 6,528 MB 5,973 MB
Actually usable 1,413 MB 1,968 MB
Application RSS ~325 MB ~337 MB
Left for page cache ~1,000 MB ~1,500 MB

Thrashing indicators

Affected vs healthy instance:

Metric Healthy (7h uptime) Affected (2h uptime) Factor
pgmajfault 14,727 8,742,890 594x
allocstall 81 2,885,556 35,600x
pgscan_kswapd 3,990,861 604,998,966 152x
Total disk reads ~12 GB ~1,982 GB 165x
workingset_refault_file 2,864,367 335,447,011 117x

The kernel dmesg also shows a kswapd BUG/crash in shrink_slab / balance_pgdat — the kernel hit a fault while trying to reclaim memory under extreme pressure.

Impact on application performance

During the thrashing period, WebSocket API requests degraded severely:

API call A:  101,454 ms  (normally <1s)
API call B:   55,214 ms  (normally <5s)
API call C:    1,036 ms  (normally ~50ms)

Why the working set doesn’t fit

Our application’s on-disk footprint is ~1.2 GB (Node.js with native modules including ML model weights at ~677 MB). Combined with the Node.js runtime, system libraries (/usr/lib at 523 MB), and operational data, the working set is ~2 GB. With only 1.4 GB usable after balloon inflation, the page cache is constantly evicted and re-read from the overlay filesystem layers.

The overlay root filesystem is backed by 3 read-only virtio block devices (vdd, vde, vdf at 8 GB each), and the repeated re-reads of these layers account for the ~2 TB of I/O in just 2 hours.

Balloon configuration issue

The virtio-balloon device (virtio0, device_id=0x0005) has the following negotiated features:

Features: 0x6000000080000000
Bits set: 31 (VIRTIO_F_ACCESS_PLATFORM), 61, 62

Critically NOT negotiated:

  • DEFLATE_ON_OOM (bit 2): The balloon does not auto-shrink when the guest is under memory pressure

  • STATS_VQ (bit 1): The host cannot query guest memory stats to make informed decisions

  • FREE_PAGE_HINT (bit 5): Not active for dynamic adjustment

This means once the balloon inflates, the guest has no mechanism to reclaim memory even when it is thrashing to death.

Questions / Requests

  1. Is this level of balloon inflation (75–82%) intentional? On most cloud platforms, VMs receive close to their advertised RAM. Leaving only 1.4–2 GB usable out of 8 GB seems extremely aggressive.

  2. Can DEFLATE_ON_OOM be enabled? This would allow the balloon to automatically shrink when the guest is under memory pressure, preventing the thrashing death spiral.

  3. Can STATS_VQ be enabled? This would let the hypervisor make smarter balloon sizing decisions based on actual guest memory utilization.

  4. Alternatively, could VMs be allocated more physical memory to account for balloon overhead? For example, allocating 16 GB so that ~8 GB remains usable after balloon.

How to reproduce

On any Firecracker VM with the balloon driver:

  1. cat /proc/vmstat | grep balloon — check NET inflated pages

  2. free -m — observe actual available memory vs MemTotal

  3. Run any workload with a ~2 GB working set and observe load average spike as page cache is exhausted

Happy to provide additional diagnostics. Thanks for looking into this.

2 Likes

We have been testing this feature over the past few weeks on a subset of hosts and observing customer impact.

Taking this feedback to the team as we fine-tune how this is applied. We will also document the ballooning parameters we use once finalized, so customers can understand and plan around them.

Generally speaking, the intention is for ballooning to have limited performance degradation on affected customer hosts.

Thanks for looking into this and for the transparency around the testing rollout.

For context, we’re running our workloads on Sprite VMs. A few follow-up questions:

  1. Is there a way to opt out of memory ballooning entirely? For latency-sensitive workloads (like ours with ML
    model weights that need to stay resident in page cache), even “limited” balloon inflation can cause a cascading
    thrashing spiral. Is ballooning applied uniformly across all VM products, or are there options (e.g. switching to
    Machines or upgrading our account tier) that would give us non-overcommitted hosts?
  2. Can we pin specific VMs to non-overcommitted hardware? Some cloud providers offer “dedicated hosts” or “metal”
    instances that guarantee the full advertised RAM. Is there an equivalent on Fly — either today or planned — where
    we could pay a premium for guaranteed memory?
  3. In the interim, is there a workaround? For example:
    • Would provisioning a 16 GB Sprite VM and only using ~8 GB give us enough headroom to survive balloon inflation?
      Or would the balloon simply scale proportionally and still claim 75-82%?
    • Would manually creating a swap file on a persistent volume help absorb the pressure when the balloon inflates,
      preventing the kswapd thrashing spiral and OOM kills?

We’re happy to upgrade our account or switch products if that’s what it takes to get non-overcommitted resources.
Our main concern is predictability — we need to know that 8 GB means at least ~7 GB usable, not 1.4 GB.

Sprites are very different from Machines here, yes. @kurt can chime in with more details (and I’m sure a lot of this is yet to be decided) but with Sprites you only pay for the resources you’re using, and ballooning is a way for “unused” resources to be reclaimed - the current algorithm likely assumes the user doesn’t care as much about page cache as you do.

for Machines we’re still experimenting as Matt said, but we’ll probably only end up using ballooning when a host is under significant memory pressure, or other similarly rare situations.

we do offer dedicated hosts if that’s a path you want to go down - though I will say it only really makes sense cost-wise if you’re expecting to use at least 1TB RAM within a single region. you definitely don’t need dedicated hosts to get the resources you expect on Fly!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.