.net 5/6 app killed because of memory usage; no cgroup limit being the reason?

tl;dr it seems the memory limit is not reflected in /sys/fs/cgroup/memory/memory.limit_in_bytes, causing dotnet runtime to exceed memory usage.


I decided to give fly.io a go as a hosting platform for dockerized dotnet (.net 5) web app.
For this purpose I am using a free plan to deploy the simple test ASP.NET 5 app I have.
Unfortunately the app fails on boot; it is being killed because it exceeds the memory limit.

I was quite surprised because I had it recently ran on a similar platform that still provides free plan. Although they provide 2x the memory available here in the free plan, the app never exceeded ~150 Mb. And definitely, not during the boot.
I ran it locally in a docker container limiting the memory to 200M and it ran just fine.

no cgroup limit:
After some further investigation - I believe dotnet runtime respects cgroup memory limits (*)
However it looks the fly.io runtime doesn’t set these values.
Looking at the following file content: /sys/fs/cgroup/memory/memory.limit_in_bytes.
For the local docker (limited to 200m) it is 209715200
For the other hosting (limited to 512m) it is 536870912
For fly.io (free plan, 256m) it is 9223372036854771712

My guessing is, because of the cgroup limit not being set correctly, dotnet is not aware of a current limit. It therefore keeps allocating more memory and gets killed eventually.

(*) - didn’t do debugging but this code seems to be relevant runtime/Interop.cgroups.cs at main · dotnet/runtime · GitHub

2 Likes

As a workaround you could try setting DOTNET_GCHeapHardLimit or DOTNET_GCHeapHardLimitPercent environment variables.

1 Like

In theory it should do the trick. Unfortunately there is something else going on, which I don’t (yet) understand. It seems the app is OOM-killed when the virtual memory for the process exceeds the quota… So even when limiting the available memory via mentioned env variables, the process still gets killed.

1 Like

Interesting find about cgroup limits incorrectly set. There might be something else thats incorrectly set, as well?

…might explain some of the OOMs folks have been stumbling upon just on Fly, like this one: Out of Memory restarts - #12 by poacher2k

VMs on the Fly.io platform aren’t running as Docker containers isolated by cgroups, they run as Firecracker micro-VMs that have their own entire guest Linux kernel at their disposal. This means that available memory can be determined directly through /proc/meminfo or the free utility.
It sounds like the application framework you’re using is expecting to run within a cgroups memory constraint, which isn’t the case here. (That 9223372036854771712 value is a default / unconstrained value, not the total amount of memory on the host system.)

Maybe there is some workaround to config your application to ignore cgroups, and use all the memory available as reported by the Linux kernel?

2 Likes

Thanks for the clarification. This is a standard .NET 5 app (ASP.NET) and these runs in millions of copies on Linux-based stacks. It’s not like I’m using some barely known or used framework :slight_smile: Will try to follow the Firecracker lead, maybe there is some known issue when running dotnet on Firecracker.

I might be wrong on the cgroup lead. I did some more debugging and logging and found out the .NET app in fact is aware of the ~200MB limitation. At least the code I use for the memory checks; maybe the runtime still falls backs to cgroup in some obscure case.

1 Like

A follow-up: I set up a simple .NET 5 project that only allocates memory in a loop, each time adding 10Mb and reporting the following:

  • Memory load
  • Total available memory
  • Process (self) private memory size
    You can see the Main routine in the end of the post, and the entire source in the Gist linked below

Now, the strange part. When ran locally in docker container with a limited memory, the application eventually throws System.OutOfMemoryException which is really an expected behavior.
However when ran here in fly.io, it never throws the above, instead gets eventually killed.

Local docker output (tail):

...
memintense_1  | Allocating next 10Mb, i = 13 (total: 130Mb)
memintense_1  | ENV: 94 MB/147 MB. PrivateMemorySize 206 MB
memintense_1  | Allocating next 10Mb, i = 14 (total: 140Mb)
memintense_1  | Okay, got OutOfMemoryException!
memintense_1  | Out of memory.
memtestplain_memintense_1 exited with code 139

comparing to fly.io log:

2022-08-30T16:10:52.111 app[79e9d943] fra [info] ENV: 79 MB/221 MB. PrivateMemorySize 242 MB
2022-08-30T16:10:52.111 app[79e9d943] fra [info] Allocating next 10Mb, i = 19 (total: 190Mb)
2022-08-30T16:10:52.502 app[79e9d943] fra [info] ENV: 79 MB/221 MB. PrivateMemorySize 252 MB
2022-08-30T16:10:52.509 app[79e9d943] fra [info] Allocating next 10Mb, i = 20 (total: 200Mb)
2022-08-30T16:10:52.521 app[79e9d943] fra [info] [ 1.066235] Out of memory: Killed process 515 (dotnet) total-vm:3104720kB, anon-rss:202216kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:644kB oom_score_adj:0
2022-08-30T16:10:52.575 app[79e9d943] fra [info] Starting clean up.
2022-08-30T16:10:52.581 app[79e9d943] fra [info] Process appears to have been OOM killed!

Now, in the actual app I initially tried to run, there are some heavy libraries involved and it’s one of them that uses a lot of memory when booting, likely causing the issue. I’m not however convinced it really is the actual culprit, I believe the lack of OutOfMemoryException when running here at Fly indicates the things are taking a strange turn


the app code:

static void Main(string[] args)
{
    GC.Collect(); // so the stats are available

    var i = 0;
    while (i++ < 256)
    {
        MemoryUtils.LogMemoryInfo();
        Console.WriteLine($"Allocating next 10Mb, i = {i} (total: {i*10}Mb)");
        MemoryUtils.AllocateMemory(10);
    }
    Console.WriteLine(MemoryUtils.UseBuffers());
}
1 Like

I am pretty sure you are on to something. I wonder if this behaviour could be reproduced with Go or some statically compiled language targeting Linux. For .NET, Linux may very well be a second-class citizen?

Are you running the image with Docker on Linux? If not, it may be running it in a VM? On Fly, the container image is transmogrified and ran as-is, in a Firecracker-managed VM (running Linux).

I feel I reached kind of dead end in .net investigation (I mean, I have no idea on what could be easily checked without spending too much time debugging the behavior remotely - have to remind myself it was supposed to be a quick proof of concept, running my app/stack on a promising hosting platform).

Therefore I was planning to reproduce the behavior in python (it’s the only platform I have experience with that is also mentioned in Fly.io tutorials), and also possibly in c++, just to see where it leads me. It should be relatively quick and might shed some light on the case.

As for .NET on Linux not being a first-class citizent; I don’t think it’s true anymore. It was true back in the .NET Framework (think 1.x-4.x) times, when running things on Linux was possible to some extent with Mono but it was quite cumbersome.
Since .NET Core times and now .NET (5/6), I’d say Linux is definitely a first-class citizen. Even multiple services offered in Azure are running Linux.

Regarding Docker - I’m running the Linux based container in Docker on Windows 10. So I believe it runs in a VM anyway.

1 Like

Aside from setting DOTNET_GCHeapHardLimit , can you also compare the values of /proc/sys/vm/overcommit_memory inside docker and fly.io?

1 Like

checked it last night, 0 in both docker and fly

1 Like

This is probably crashing because our VMs don’t include swap. The out of memory error is showing the process using more RAM than is available.

The weird part is, the .NET runtime throws OutOfMemory exception (as expected) when the .NET app is ran on their local docker installation; whereas on Fly, the app instead ends up being killed by the Kernel. That is, the .NET runtime acts as if it is oblivious to the amount of RAM available inside a Fly VM (which is strange).

Yeah, I believe the process is using more ram than the .NET runtime indicates, and probably going into swap on Docker.

The OOM error shows a lot more RAM in use than the .NET tracking does:

Killed process 515 (dotnet) ... anon-rss:202216kB

vs

Allocating next 10Mb, i = 20 (total: 200Mb)
1 Like

nah this is just the iteration * 10 result, meaning how much I’m allocating in a loop. The private memory size is reported in a line above and it is indeed higher than expected:

2022-08-30T16:10:52.502 app[79e9d943] fra [info] ENV: 79 MB/221 MB. PrivateMemorySize 252 MB

Anyway this is somewhat aligned with the memory usage reported by the oom killer.
I can see now in the logs the env memory usage is not refreshed frequently enough; will try to improve it and post updated logs.

Docker and swap is an interesting point, I missed it in my docker-compose. However I tried it again now limiting swap to 0 (set both mem_limit and memswap_limit to the same value of 192m) and still get the System.OutOfMemoryException (let’s agree on calling it a graceful fail. Still better than getting killed by the OOM killer). I also adjusted the app to report more current environment memory available (it was not being updated in the loop, happened randomly at garbage collection).
Also added logging the /proc/swapon content.

This is my docker-compose:

services:
  memintense:
    build: .    
    mem_limit: 192m
    memswap_limit: 192m 

and the output from local docker; again - dies gracefully.

memintense_1  | Allocating next 10Mb, i = 14 (total: 140Mb)
memintense_1  | ENV: 145 MB/144 MB. PrivateMemorySize 207 MB
memintense_1  | SWAP: Filename                          Type            Size            Used            Priority
memintense_1  | /swap/file                              file            4194304         0               -2
memintense_1  |
memintense_1  | Allocating next 10Mb, i = 15 (total: 150Mb)
memintense_1  | Okay, got OutOfMemoryException!
memintense_1  | Out of memory.
memtestplain_memintense_1 exited with code 139

and the fly.io deployment

2022-09-01T20:55:03.378 app[88457826] fra [info] Allocating next 10Mb, i = 19 (total: 190Mb)
2022-09-01T20:55:03.608 app[88457826] fra [info] ENV: 216 MB/221 MB. PrivateMemorySize 239 MB
2022-09-01T20:55:03.626 app[88457826] fra [info] SWAP: Filename Type Size Used Priority
2022-09-01T20:55:03.632 app[88457826] fra [info] Allocating next 10Mb, i = 20 (total: 200Mb)
2022-09-01T20:55:03.644 app[88457826] fra [info] [ 1.012858] Out of memory: Killed process 516 (dotnet) total-vm:3026240kB, anon-rss:202216kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:624kB oom_score_adj:0

I understand it’s getting killed because it allocates more memory than allowed; what I don’t understand is why it actually happens, why is dotnet runtime not detecting what is coming. I believe this is the root cause of the asp.net app dying.

the important bit, not only for me, but for any future customer planning to deploy .NET app here is to figure out how it will behave under actual load. Will the load kill the app, or will the .NET framework respect the limits somehow?

1 Like

Swap-limit accounting is disabled on Ubuntu/Debian by default, and so it looks like memswap_limit had no effect in your test (I see a 4gb swapfile in that log output). It does seem like the difference in the Docker test is that it’s using swap.

2 Likes

Good catch. Will try to use modified base image to enable the memswap limit as per the docs you linked

1 Like

How about a swap = size_as_percentage_of_available_ram in fly.toml as an experimental feature?

Just a small tip: OutOfMemoryException in .NET is not 100% guaranteed to be caught. That beast is somewhat similar to StackOverflowException. Most of the times you can catch it, but sometimes you can’t; it all depends on memory conditions the app runs at. If the memory exhaustion is catastrophic enough then there may be no room left to execute the exception handler.

A common advice is to keep the OutOfMemoryException handler as short and memory savy as possible. For instance, instead of calling Console.WriteLine method one should consider Environment.FailFast which requires minimal memory allocations and thus has a bit more chances to “survive” the exhaustion with some observable outcome.

2 Likes