Out of Memory restarts

I recently added some file-based caching to my app: add caching to github stuff · kentcdodds/remix-kentcdodds@0d530ae · GitHub

Since deploying that I’ve strangely gotten a 404 on this post: https://kent.dev/blog/2010s-decade-in-review

But almost just as frequently I also get a 502 on that post. I’ve checked the logs and they look like this:

2021-06-02T12:19:57.667749102Z app[7ae2dc40] dfw [info] [   68.111839] Out of memory: Killed process 539 (node) total-vm:984772kB, anon-rss:182740kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:3824kB oom_score_adj:0
2021-06-02T12:19:57.718559793Z app[7ae2dc40] dfw [info] npm notice
2021-06-02T12:19:57.719533961Z app[7ae2dc40] dfw [info] npm notice New minor version of npm available! 7.13.0 -> 7.15.1
2021-06-02T12:19:57.720591483Z app[7ae2dc40] dfw [info] npm notice Changelog: <https://github.com/npm/cli/releases/tag/v7.15.1>
2021-06-02T12:19:57.721417127Z app[7ae2dc40] dfw [info] npm notice Run `npm install -g npm@7.15.1` to update!
2021-06-02T12:19:57.721662855Z app[7ae2dc40] dfw [info] npm notice
2021-06-02T12:19:57.730374703Z app[7ae2dc40] dfw [info] Main child exited normally with code: 1
2021-06-02T12:19:57.731183757Z app[7ae2dc40] dfw [info] Starting clean up.
2021-06-02T12:20:30.406625362Z runner[d0701bc9] dfw [info] Starting instance
2021-06-02T12:20:30.454473453Z runner[d0701bc9] dfw [info] Configuring virtual machine
2021-06-02T12:20:30.456023960Z runner[d0701bc9] dfw [info] Pulling container image
2021-06-02T12:20:30.798229633Z runner[d0701bc9] dfw [info] Unpacking image
2021-06-02T12:20:30.813118778Z runner[d0701bc9] dfw [info] Preparing kernel init
2021-06-02T12:20:31.332986722Z runner[d0701bc9] dfw [info] Configuring firecracker
2021-06-02T12:20:31.398656948Z runner[d0701bc9] dfw [info] Starting virtual machine
2021-06-02T12:20:31.605245187Z app[d0701bc9] dfw [info] Starting init (commit: cc4f071)...
2021-06-02T12:20:31.640565789Z app[d0701bc9] dfw [info] Running: `/cnb/process/web` as heroku
2021-06-02T12:20:31.659702141Z app[d0701bc9] dfw [info] 2021/06/02 12:20:31 listening on [fdaa:0:23df:a7b:7b:d070:1bc9:2]:22 (DNS: [fdaa::3]:53)

I’ve taken memory snapshots of my running app locally (simulating the production version the best I can) and I haven’t been able to get the memory above 50 MB and it doesn’t increase over time either. It stays pretty steady between 30 and 50.

I’m at a loss of what to do at this point. Suggestions are welcome :slight_smile:

Looking at the charts, it does look like this instance (7ae2dc40) is operating with ~8MB of free memory at all times. So it would be pretty sensitive to any increase in memory usage. You can see memory charts in our dashboard if you navigate to your app’s metrics page.

I’d suggest flyctl ssh console-ing into your app and looking at top (or htop) and maybe watching dmesg to see what’s happening. It might kick you out as soon as it OOMs though so you might not be able to catch it.

In any case, this would show you what’s using the most memory inside the instance and might help you troubleshoot.

I wonder if

cross-env NODE_ENV=production remix-serve build

keeps too much memory around? I don’t really know how remix works, but it sounds like it might be building your app at runtime? If you can pre-build it in your docker image, that would be ideal. Were you testing locally with this exact command?

This is running on a fairly small instance too (1x shared CPU and 256MB of memory), you might need more memory to run this specific app.

Thanks for the info! I have no idea why it’s consuming so much memory on Fly. remix-serve is a simple express server. The build part is just the directory that the express server should serve. We’re not building at runtime.

When testing locally I ran this:

NODE_ENV=production node --inspect-brk --require ./node_modules/dotenv/config ./node_modules/.bin/remix-serve build

That way I could take the memory snapshot and also simulate the same sort of thing that’s happening in production (with the right env vars set etc.).

This might have something to do with node’s “max old space” and other related V8 memory settings.

I’m reading up on running node.js in a memory-constrained environment and I think you could try the following:

cross-env NODE_ENV=production NODE_OPTIONS=--max-old-space-size=200 remix-serve build

200MB is about ~4/5th of the 256MB allowed in these instances.

edit: might be easier to set it in [env] in your fly.toml.

You could also scale your VM’s memory:

flyctl scale memory 512

edit: I’ve been told 1024 might be the minimum

Thanks for the tips! I’m going to try and identify the memory issue. It really doesn’t make sense that it would take up that much memory in the first place. But at least it makes sense that I’m running out :sweat_smile:

Just a quick note, I’ve just verified that my app has been this way on Fly for as long as I’ve been deploying on fly. I’m concerned about why the memory usage is so high on Fly but not locally, but at least I know that I didn’t break something recently :sweat_smile:

I’m also seeing high memory usage on my instance, ~160MB with a simple express server.

Node OOMs are especially tricky to chase down. This might be deeper than you want to go, but it’s worth exporting Node runtime metrics to our metrics service: GitHub - siimon/prom-client: Prometheus client for node.js

That client includes numbers for active handles and GC and event loop lag, all of which are helpful when Node apps stumble. If you get that exporting we can help you with a Grafana dashboard.

~160MB for the whole “system” is probably ok. Some used memory is disk cache, which is only kind of “in use”.

I hadn’t considered that. How much memory overhead is there in fly running generally? Would that explain the ~180MBs of extra memory used (the ~40MBs I measured locally vs the ~220MBs on Fly)?

There’s very little from us, easily <5MB. Probably closer to 1MB at any given time. But since you’re running in a VM, you get all the OS related memory allocations to think about too. Here’s how one of your VMs was using memory recently:

That yellow section (RAM used) is most likely what node is doing. The Cach + Buffer (blue) is the operating system’s cache, which gets evicted when your process needs it.

It seems like your node process grabbed 240mb RAM at one point, and then settled, then bounced back up a little.

We need to make it easier to setup Grafana so you can see all this. It’s a pain in the ass the first time you start trying to use it, after that it’s incredibly helpful.