simple phoenix app with shared-cpu-1x is failing with OOM

I was making a simple HTML change to my Phoenix (1.7.11) app and it is now failing due to OOM. I tried reverting my changes and it is still happening the same. My Phoenix app is very simple and it has 0 users. I am in the hobby plan and so using shared-cpu-1x

Below is the error and even the total VM memory used 1789260kB is below the 256mb limit.
So basically I am clueless

ams [info] [ 4.536562] Out of memory: Killed process 318 (beam.smp) total-vm:1789260kB, anon-rss:83092kB, file-rss:0kB, shmem-rss:71476kB, UID:65534 pgtables:472kB oom_score_adj:0
ams [info] INFO Main child exited with signal (with signal 'SIGKILL', core dumped? false)
ams [info] INFO Process appears to have been OOM killed!

From General to Phoenix

Added elixir, troubleshooting

It is not.

1789260kiB = 1747.32MiB > 256 MiB (these are kibibytes/mebibytes, not kilobytes/megabytes).

You definitely need to add more memory - OOMs don’t lie :slight_smile:

  • Daniel
1 Like

My Phoenix app had started going into OOM loop at deployment just now too. Tbf, mine veers dangerously close to 256mb limit anyway (but has been deploying fine) so any new code added was my first suspect, specially as it still works fine under 512mb.

However I rolled back release to last known healthy image that was working fine earlier (deployed 2.5 days ago), but even that one is getting OOM killed now.

1 Like

Speaking off, I noticed my nextjs app suddenly has about +60MB of usage in the last couple days, nothing changed that I can think of. Did something on Fly’s end cause this upshoot in memory usage or the 3 of us just coincidence?

1 Like

I have the exact problem with a simple phoenix app. Have been running fine for a year now, but started getting OOM when restarting today.

arn [info] [ 4.898602] Out of memory: Killed process 318 (beam.smp) total-vm:1767996kB, anon-rss:79500kB, file-rss:0kB, shmem-rss:77600kB, UID:65534 pgtables:468kB oom_score_adj:0
arn [info] INFO Main child exited with signal (with signal 'SIGKILL', core dumped? false)
arn [info] INFO Process appears to have been OOM killed!

The total VM memory used is similar to OP. Works fine when scaling to 512mb, but when I am scaling down to 256mb again I get the same problem.

1 Like

@roadmr Thank you for the reply.
I misinterpreted it since my usual memory usage was around the range of 180MiB.
I am still struggling to figure out what changed from yesterday. I tried to revert to the old version and had no luck. And the traffic has not changed at all it is always very low almost 0.
Is 512 MiB memory required to run a simple basic Phoenix app now?

We have started suggesting a minimum of 1gb RAM for full stack apps. It’s actually kinda hard to run stuff in under 256mb.

For side projects it’s far better to let machines auto stop / scale to zero for money saving purposes.

We’re seeing normal numbers of OOMs across all customers.

If you’re curious, February 28th of this year was our busiest OOM day ever.

Is this across ALL memory permutations, or just the 256MB configuration? Did you notice any difference in 256MB => 512MB changes in the last few days? I’ve bumped my app 1 tier up to avoid the weird memory crawl, I wonder if that has any affect on your observation of normal OOM errors…

@kurt I believe there is something funky going on on Fly’s infra

I have a typescript temporal worker running on 256MB, it flatlines at about 174MB for the last while. Then literally a few minutes ago, I scale the worker to 0 and rescaled it back to 1 (no code or config changes) and the baseline memory shot up to about 200MB.

I’m a little confused.

What’s also odd is the previous stats shows 217MB of total, now the total is 213MB, something on Fly is consuming 4 more MB.

2 Likes

For me adding the swap_size_mb at fly.toml fixed the OOM issue.
But I can see a jump in memory usage from 180 MB to 200 MB without any code or traffic change.
Also, I am seeing the OOM with my Postgres now. :frowning:

swap_size_mb = 512

With no change in application code, we’re also seeing new deploys fail. I’ve tested locally, constraining memory to 128MB for the app process (node) and separately constraining via docker. I have to limit memory to 64MB in order to trigger an OOM. There definitely appears to be something the matter with the 256MB VMs, our first deploy failed yesterday. I’ve also destroyed an existing machine and deployed to a new with the same results. According to Fly’s Grafana, a fly instance (machine/vm) for this app (when running) never uses more than 145MB.

Updated data point: to be clear, you can replicate a deploy exactly (by checking out and deployed a prior release that deployed stably) and see that the machine no longer boots.

For now, the app is deployed and running on 256MB, but I had to add a tiny swap partition. That’s swap_size_mb = 128 as a top level option in fly.toml

I wouldn’t recommend using swap_size_mb as a workaround, since suspend doesn’t support it. We have to wait to see if this change was intentional or not. If it is, I guess they are kicking the 256MB bums (like myself) off the bus bench.

Where is the incompatibility you mention documented? I couldn’t find anything that mentions it.

It’s not documented in the official docs, but it’s mentioned in a post: New feature in preview: suspend/resume for Machines

Any updates from Fly on this?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.