Out of Memory restarts

I recently added some file-based caching to my app: add caching to github stuff · kentcdodds/remix-kentcdodds@0d530ae · GitHub

Since deploying that I’ve strangely gotten a 404 on this post: https://kent.dev/blog/2010s-decade-in-review

But almost just as frequently I also get a 502 on that post. I’ve checked the logs and they look like this:

2021-06-02T12:19:57.667749102Z app[7ae2dc40] dfw [info] [   68.111839] Out of memory: Killed process 539 (node) total-vm:984772kB, anon-rss:182740kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:3824kB oom_score_adj:0
2021-06-02T12:19:57.718559793Z app[7ae2dc40] dfw [info] npm notice
2021-06-02T12:19:57.719533961Z app[7ae2dc40] dfw [info] npm notice New minor version of npm available! 7.13.0 -> 7.15.1
2021-06-02T12:19:57.720591483Z app[7ae2dc40] dfw [info] npm notice Changelog: <https://github.com/npm/cli/releases/tag/v7.15.1>
2021-06-02T12:19:57.721417127Z app[7ae2dc40] dfw [info] npm notice Run `npm install -g npm@7.15.1` to update!
2021-06-02T12:19:57.721662855Z app[7ae2dc40] dfw [info] npm notice
2021-06-02T12:19:57.730374703Z app[7ae2dc40] dfw [info] Main child exited normally with code: 1
2021-06-02T12:19:57.731183757Z app[7ae2dc40] dfw [info] Starting clean up.
2021-06-02T12:20:30.406625362Z runner[d0701bc9] dfw [info] Starting instance
2021-06-02T12:20:30.454473453Z runner[d0701bc9] dfw [info] Configuring virtual machine
2021-06-02T12:20:30.456023960Z runner[d0701bc9] dfw [info] Pulling container image
2021-06-02T12:20:30.798229633Z runner[d0701bc9] dfw [info] Unpacking image
2021-06-02T12:20:30.813118778Z runner[d0701bc9] dfw [info] Preparing kernel init
2021-06-02T12:20:31.332986722Z runner[d0701bc9] dfw [info] Configuring firecracker
2021-06-02T12:20:31.398656948Z runner[d0701bc9] dfw [info] Starting virtual machine
2021-06-02T12:20:31.605245187Z app[d0701bc9] dfw [info] Starting init (commit: cc4f071)...
2021-06-02T12:20:31.640565789Z app[d0701bc9] dfw [info] Running: `/cnb/process/web` as heroku
2021-06-02T12:20:31.659702141Z app[d0701bc9] dfw [info] 2021/06/02 12:20:31 listening on [fdaa:0:23df:a7b:7b:d070:1bc9:2]:22 (DNS: [fdaa::3]:53)

I’ve taken memory snapshots of my running app locally (simulating the production version the best I can) and I haven’t been able to get the memory above 50 MB and it doesn’t increase over time either. It stays pretty steady between 30 and 50.

I’m at a loss of what to do at this point. Suggestions are welcome :slight_smile:

1 Like

Looking at the charts, it does look like this instance (7ae2dc40) is operating with ~8MB of free memory at all times. So it would be pretty sensitive to any increase in memory usage. You can see memory charts in our dashboard if you navigate to your app’s metrics page.

I’d suggest flyctl ssh console-ing into your app and looking at top (or htop) and maybe watching dmesg to see what’s happening. It might kick you out as soon as it OOMs though so you might not be able to catch it.

In any case, this would show you what’s using the most memory inside the instance and might help you troubleshoot.

I wonder if

cross-env NODE_ENV=production remix-serve build

keeps too much memory around? I don’t really know how remix works, but it sounds like it might be building your app at runtime? If you can pre-build it in your docker image, that would be ideal. Were you testing locally with this exact command?

This is running on a fairly small instance too (1x shared CPU and 256MB of memory), you might need more memory to run this specific app.

Thanks for the info! I have no idea why it’s consuming so much memory on Fly. remix-serve is a simple express server. The build part is just the directory that the express server should serve. We’re not building at runtime.

When testing locally I ran this:

NODE_ENV=production node --inspect-brk --require ./node_modules/dotenv/config ./node_modules/.bin/remix-serve build

That way I could take the memory snapshot and also simulate the same sort of thing that’s happening in production (with the right env vars set etc.).

This might have something to do with node’s “max old space” and other related V8 memory settings.

I’m reading up on running node.js in a memory-constrained environment and I think you could try the following:

cross-env NODE_ENV=production NODE_OPTIONS=--max-old-space-size=200 remix-serve build

200MB is about ~4/5th of the 256MB allowed in these instances.

edit: might be easier to set it in [env] in your fly.toml.

You could also scale your VM’s memory:

flyctl scale memory 512

edit: I’ve been told 1024 might be the minimum

Thanks for the tips! I’m going to try and identify the memory issue. It really doesn’t make sense that it would take up that much memory in the first place. But at least it makes sense that I’m running out :sweat_smile:

Just a quick note, I’ve just verified that my app has been this way on Fly for as long as I’ve been deploying on fly. I’m concerned about why the memory usage is so high on Fly but not locally, but at least I know that I didn’t break something recently :sweat_smile:

I’m also seeing high memory usage on my instance, ~160MB with a simple express server.

Node OOMs are especially tricky to chase down. This might be deeper than you want to go, but it’s worth exporting Node runtime metrics to our metrics service: GitHub - siimon/prom-client: Prometheus client for node.js

That client includes numbers for active handles and GC and event loop lag, all of which are helpful when Node apps stumble. If you get that exporting we can help you with a Grafana dashboard.

~160MB for the whole “system” is probably ok. Some used memory is disk cache, which is only kind of “in use”.

I hadn’t considered that. How much memory overhead is there in fly running generally? Would that explain the ~180MBs of extra memory used (the ~40MBs I measured locally vs the ~220MBs on Fly)?

There’s very little from us, easily <5MB. Probably closer to 1MB at any given time. But since you’re running in a VM, you get all the OS related memory allocations to think about too. Here’s how one of your VMs was using memory recently:

That yellow section (RAM used) is most likely what node is doing. The Cach + Buffer (blue) is the operating system’s cache, which gets evicted when your process needs it.

It seems like your node process grabbed 240mb RAM at one point, and then settled, then bounced back up a little.

We need to make it easier to setup Grafana so you can see all this. It’s a pain in the ass the first time you start trying to use it, after that it’s incredibly helpful.

Did any of you ever figure out a solution to this?

Below is a screenshot of the memory usage my app has on Heroku, which never even begins to approach the 512MB quota. Here on Fly, deploying with 256MB, 512MB and 1GB has all failed due to OOM.

image

1 Like

By comparing outputs from cmds like free, meminfo, top / htop (have to install it) between heroku and fly, you may get your answer?

Report back in case you do. It will be interesting to see that data.

As for analyzing heap within nodejs itself, consider setting up clinicjs: https://clinicjs.org/heapprofiler/

In addition to what @ignoramous said, post your OOM message here? It should be in an email.

Our platform is different enough from Heroku that it’s hard to compare graphs. Your processes run in isolated VMs here. This means there are more memory metrics that might trigger an OOM.

Ah, I found the OOM message (you can see it with fly logs).

Try it with 2GB memory? It appears to be spiking to >1GB at boot time, it’s possible it will settle after that.

Thank you for the pointers, I’ll try this when I get a chance!

With 2 GB it starts, but it stays near max memory usage over several minutes. If you look at the Heroku screenshot, there’s a small dip right before 12 PM, which corresponds to the daily reboot, and memory usage is below 100 MB from start, and stays that way for hours. This last part matches the experience when I run the application locally, with ~100 MB memory being used.

1 Like

Oh well that’s really strange.

If you can get it running for a decent amount of time, try this:

fly ssh console

Then run top and hit shift +m. This should sort processes by memory usage.

You can also look here to get a breakdown of what memory is actually in use. It’s possible some specific metric is varying over time, but the page cache is keeping it “full”: Sign In · Fly

My best guess on what’s happening here is that Heroku sets some environment variable that controls Node apps memory usage, and we’re not bringing it over. But that’s a wild guess with no evidence. :slight_smile:

I put a $10 credit on your account so you have time to debug this.

If anyone wants some closure on this… I figured out my problem and wrote about it: Fixing a Memory Leak in a Production Node.js App

1 Like