I’m running a PHP 8.1 service and it appears there is a memory leak somewhere.
The leak is tiny, but after some days the memory reaches to maximum amount of my VM and then the service does not response at all and I need to suspend/resume it by hand.
I run several tests with valgrind, narrowed down some piece of code, tried to understand what’s happening, but it seems the problem does not come at all from the service itself.
Could it come from all the logs the app is generating on STDOUT ?
Is there some other thing I could test directly on the VM from SSH ?
This shouldn’t take up any memory (or almost none). We’re copying logs in a “zero copy”, which is a very efficient way to do this (via the splice syscall).
In fact I don’t really understand well how to read the memory statistics. Here’s what I have :
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
691 678 root R 47320 21% 0 96% /usr/local/bin/php src/index.php
516 1 root S 689m 311% 0 0% /.fly/hallpass
1 0 root S 4312 2% 0 0% /fly/init
722 516 root S 1696 1% 0 0% /bin/sh
796 516 root S 1664 1% 0 0% /bin/sh
797 796 root R 1596 1% 0 0% top
678 676 root S 1596 1% 0 0% {run} /bin/sh ./run
676 515 root S 204 0% 0 0% s6-supervise app
544 515 root S 204 0% 0 0% s6-supervise s6-fdholderd
515 1 root S 204 0% 0 0% s6-svscan -t0 /var/run/s6/services
I obviously saw how many memory hallpass was using, but it is probably normal and shared between instances.
As far as I understand, the “leak” that I see in Firecracker memory graph comes from the buffer/cache amount in the instance. But… buffer/cache memory is normally managed by the OS and freed as needed, isn’t it ? Why is it shown in the graph ?
Mem: 68000K used, 158472K free, 0K shrd, 2704K buff, 32196K cached
CPU: 85% usr 10% sys 0% nic 3% idle 0% io 0% irq 0% sirq
Load average: 0.48 0.13 0.04 3/52 694
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
691 678 root R 35492 16% 0 96% /usr/local/bin/php src/index.php
516 1 root S 688m 311% 0 0% /.fly/hallpass
1 0 root S 4300 2% 0 0% /fly/init
692 516 root S 1664 1% 0 0% /bin/sh
693 692 root R 1596 1% 0 0% top
678 676 root S 1596 1% 0 0% {run} /bin/sh ./run
515 1 root S 204 0% 0 0% s6-svscan -t0 /var/run/s6/services
676 515 root S 204 0% 0 0% s6-supervise app
544 515 root S 204 0% 0 0% s6-supervise s6-fdholderd
edit : So this is quite the same output from earlier, the significative difference is the Mem Used (176120K against 68000K), and the Buffer (68936K vs 2704K). But it seems to me that the top output is lacking on some mem info for each process. I’ll try to find something else.
The memory used by the service is still the same, but we have lost about 60mb of system memory and the buffer is 6 times higher! No idea where this memory is used, but it is indeed count in the Firecracker memory graph as actually used memory.
I’ll keep my investigations to find what I’m missing here.
> free -m
total used free shared buff/cache available
Mem: 221 39 66 0 116 175
Swap: 0 0 0
It seems that there are instead 175mb of available memory, not 66mb. The buffer/cache is, normally, a cache used by the kernel to store recurrent data, generally speaking : it caches disk data. But here, there is no disk, then what could it cache ?
In the fly.io metrics it is reported that the used memory is 155mb and we can see from free -m that it corresponds to 39mb used + 116mb buffer. So the metrics see the buffer as used memory ? I read here and there that the buffer should not be a problem and it should release itself as needed, but this does not seem to be the case with Firecracker ?
What is the version of Firecracker in your infrastructure ? Because it really seems to match my case (as a reminder, my service outputs a lot of logs).
I misread the CVE description, and it talks about STDIN, not STDOUT. But, anyway, I’ll test to update the log level to WARNING instead of INFO to see if it changes anything.
Cache / buffer is used memory, but it’s safe to ignore (for the most part). Are you experiencing OOM crashes or is the service just not responding after a certain amount of time?
If it’s non-responsive, it may not be directly related to memory use.
Yes definitely try SSH. If memory is the problem, you should see something that’s not cache/buffer change.
I don’t know exactly how many logs you’re generating, or where they’re getting written to, but I’d double check that PHP isn’t buffering the stdout and that there aren’t any logfiles generated.
There shouldn’t be any log file generated from the service because I didn’t configured any adapter to handle that. But PHP could probably generate something by default based on its php.ini file and, indeed, it could explain why the buffer is filled like that.
The logs generated by the service were not responsible of that buffer, neither many things I tested. I recently updated the architecture of my service and now the memory is stable. It probably came from some underlying layer in PHP and the used extensions to handle FTP connections. But this is juste assumptions.