Memory leak

Hello,

I’m running a PHP 8.1 service and it appears there is a memory leak somewhere.

The leak is tiny, but after some days the memory reaches to maximum amount of my VM and then the service does not response at all and I need to suspend/resume it by hand.

I run several tests with valgrind, narrowed down some piece of code, tried to understand what’s happening, but it seems the problem does not come at all from the service itself.

Could it come from all the logs the app is generating on STDOUT ?

Is there some other thing I could test directly on the VM from SSH ?

After some dig up, I found that the metrics showed the sum of all memory counts in the Firecracker Memory graph. But with top I have :

Mem: 176120K used, 50352K free, 0K shrd, 68936K buff, 37684K cached

And if watch for my PHP service I have :

PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
691   678 root     R    47320  21%   0  96% /usr/local/bin/php src/index.php

This shouldn’t take up any memory (or almost none). We’re copying logs in a “zero copy”, which is a very efficient way to do this (via the splice syscall).

You can see memory per-instance by using a Grafana dashboard.

I’d suggest SSHing, like you’re already doing, and then keeping a look on top (like you’re doing) over time.

Is there anything else using much memory according to top?

It makes sense.

In fact I don’t really understand well how to read the memory statistics. Here’s what I have :

  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
  691   678 root     R    47320  21%   0  96% /usr/local/bin/php src/index.php
  516     1 root     S     689m 311%   0   0% /.fly/hallpass
    1     0 root     S     4312   2%   0   0% /fly/init
  722   516 root     S     1696   1%   0   0% /bin/sh
  796   516 root     S     1664   1%   0   0% /bin/sh
  797   796 root     R     1596   1%   0   0% top
  678   676 root     S     1596   1%   0   0% {run} /bin/sh ./run
  676   515 root     S      204   0%   0   0% s6-supervise app
  544   515 root     S      204   0%   0   0% s6-supervise s6-fdholderd
  515     1 root     S      204   0%   0   0% s6-svscan -t0 /var/run/s6/services

I obviously saw how many memory hallpass was using, but it is probably normal and shared between instances.

As far as I understand, the “leak” that I see in Firecracker memory graph comes from the buffer/cache amount in the instance. But… buffer/cache memory is normally managed by the OS and freed as needed, isn’t it ? Why is it shown in the graph ?

After a “reboot” of the instance :

Mem: 68000K used, 158472K free, 0K shrd, 2704K buff, 32196K cached
CPU:  85% usr  10% sys   0% nic   3% idle   0% io   0% irq   0% sirq
Load average: 0.48 0.13 0.04 3/52 694
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
  691   678 root     R    35492  16%   0  96% /usr/local/bin/php src/index.php
  516     1 root     S     688m 311%   0   0% /.fly/hallpass
    1     0 root     S     4300   2%   0   0% /fly/init
  692   516 root     S     1664   1%   0   0% /bin/sh
  693   692 root     R     1596   1%   0   0% top
  678   676 root     S     1596   1%   0   0% {run} /bin/sh ./run
  515     1 root     S      204   0%   0   0% s6-svscan -t0 /var/run/s6/services
  676   515 root     S      204   0%   0   0% s6-supervise app
  544   515 root     S      204   0%   0   0% s6-supervise s6-fdholderd

edit : So this is quite the same output from earlier, the significative difference is the Mem Used (176120K against 68000K), and the Buffer (68936K vs 2704K). But it seems to me that the top output is lacking on some mem info for each process. I’ll try to find something else.

Ok, I found how to display some more info :

Mem total:226472 anon:12164 map:17732 free:156660
 slab:12076 buf:3688 cache:32312 dirty:24 write:0
Swap total:0 free:0
  PID   VSZ^VSZRW^  RSS (SHR) DIRTY (SHR) STACK COMMAND
  516  688m 41392  4360     4  1196     0   132 /.fly/hallpass
  691 35496 11728 21644   472 10376     0   132 /usr/local/bin/php src/index.php
    1  4304   608  3040     4   272     0   132 /fly/init
  692  1668   236  1008   852   136     0   132 /bin/sh
  696  1616   184   928   824    68     0   132 top -H
  678  1600   168   904   820    60     0   132 {run} /bin/sh ./run
  544   208   144    64    44    20     0   132 s6-supervise s6-fdholderd
  515   208   144    64    12    20     0   132 s6-svscan -t0 /var/run/s6/services
  676   208   144    60    44    16     0   132 s6-supervise app

I’ll keep an eye on this and come back in a day or two when something interesting will come up.

Good! You should look at the RSS column. Virtual memory isn’t what would cause out-of-memory kills.

1 Like

I made a snapshot of the results yesterday after letting some time to the service to warm up :

Mem total:226472 anon:24936 map:18212 free:131056
 slab:13692 buf:12524 cache:34744 dirty:0 write:0
Swap total:0 free:0
  PID   VSZ^VSZRW^  RSS (SHR) DIRTY (SHR) STACK COMMAND
  516  688m 41392  6676     4  3512     0   132 /.fly/hallpass
  691 46720 22952 32576   472 20828     0   132 /usr/local/bin/php src/index.php
    1  4304   608  3040     4   272     0   132 /fly/init
  692  1668   236  1008   852   136     0   132 /bin/sh
  696  1616   184   928   824    68     0   132 top -H
  678  1600   168   904   820    60     0   132 {run} /bin/sh ./run
  544   208   144    64    44    20     0   132 s6-supervise s6-fdholderd
  515   208   144    64    12    20     0   132 s6-svscan -t0 /var/run/s6/services
  676   208   144    60    44    16     0   132 s6-supervise app

The usage is still normal.

Now, here’s what I had this morning :

Mem total:226472 anon:23516 map:18220 free:69248
 slab:23776 buf:65588 cache:34752 dirty:24 write:0
Swap total:0 free:0
  PID   VSZ^VSZRW^  RSS (SHR) DIRTY (SHR) STACK COMMAND
  516  689m 46128  5124     4  1960     0   132 /.fly/hallpass
  691 46720 22952 32660   472 20904     0   132 /usr/local/bin/php src/index.php
    1  4304   608  3040     4   272     0   132 /fly/init
  698  1668   236  1112   880   196     0   132 /bin/sh
  699  1616   184   916   844    72     0   132 top -H
  678  1600   168   904   816    60     0   132 {run} /bin/sh ./run
  544   208   144    64    44    20     0   132 s6-supervise s6-fdholderd
  515   208   144    64    12    20     0   132 s6-svscan -t0 /var/run/s6/services
  676   208   144    60    44    16     0   132 s6-supervise app

The memory used by the service is still the same, but we have lost about 60mb of system memory and the buffer is 6 times higher! No idea where this memory is used, but it is indeed count in the Firecracker memory graph as actually used memory.

I’ll keep my investigations to find what I’m missing here.

After a bit of reading and :

> free -m
              total        used        free      shared  buff/cache   available
Mem:            221          39          66           0         116         175
Swap:             0           0           0

It seems that there are instead 175mb of available memory, not 66mb. The buffer/cache is, normally, a cache used by the kernel to store recurrent data, generally speaking : it caches disk data. But here, there is no disk, then what could it cache ?

In the fly.io metrics it is reported that the used memory is 155mb and we can see from free -m that it corresponds to 39mb used + 116mb buffer. So the metrics see the buffer as used memory ? I read here and there that the buffer should not be a problem and it should release itself as needed, but this does not seem to be the case with Firecracker ?

I fell upon this CVE : CVE-2020-27174 (English version) | ProHacktive

What is the version of Firecracker in your infrastructure ? Because it really seems to match my case (as a reminder, my service outputs a lot of logs).

I misread the CVE description, and it talks about STDIN, not STDOUT. But, anyway, I’ll test to update the log level to WARNING instead of INFO to see if it changes anything.

Cache / buffer is used memory, but it’s safe to ignore (for the most part). Are you experiencing OOM crashes or is the service just not responding after a certain amount of time?

If it’s non-responsive, it may not be directly related to memory use.

At some point, when the reported memory is full, I cannot even display the logs.

I did not test to connect through SSH though, so maybe some further tests are needed.

Just wondering, why don’t you only report the available memory instead of used+ buffer ?

Yes definitely try SSH. If memory is the problem, you should see something that’s not cache/buffer change.

I don’t know exactly how many logs you’re generating, or where they’re getting written to, but I’d double check that PHP isn’t buffering the stdout and that there aren’t any logfiles generated.

There shouldn’t be any log file generated from the service because I didn’t configured any adapter to handle that. But PHP could probably generate something by default based on its php.ini file and, indeed, it could explain why the buffer is filled like that.

I’m closing this thread.

The logs generated by the service were not responsible of that buffer, neither many things I tested. I recently updated the architecture of my service and now the memory is stable. It probably came from some underlying layer in PHP and the used extensions to handle FTP connections. But this is juste assumptions.

Anyway, thanks for your kind help.