All machines across all apps on NRT reallocated memory to swap without memory pressure

I started to receive complaints from customers that my service is responding much slower throughout the day.

I noticed all my apps (NRT) have reallocated memory to swap even though there is no memory pressure. This happened across different apps as well.

example machine

7 days memory usage

2 days memory usage

As far as I can tell this happened to all my NRT machines around about the same time, while my Singapore machines are not affected.

Was there any incident in NRT data center?

Hi @echoi!

Just to make sure I understand: you’re seeing swap usage correlate with the performance complaints? If you have any other metrics or logs that show the performance drop in detail, that would help us nail this down. The time correlation is a strong signal, but more context is always better.

At first glance, the swap usage seems to grow in periodic bursts of a few seconds (roughly every hour). We run maintenance jobs on the hosts periodically, so it could be related. I went ahead and adjusted the schedule on the host running your machine to make them happen less often, just in case that’s related.

That said, I wouldn’t necessarily expect a slowdown just because pages are swapped in for a brief moment. As long as there is Available memory, a swapped page should be paged back into physical RAM the moment it’s read. Since keeping a copy in swap is ‘free’, high usage isn’t automatically an indicator of memory pressure.

I’m specifically wondering about these two questions:

  1. Why is the swap growing at all during these short bursts and
  2. Why would your app experience a slowdown during windows where swap usage is stable and there is no memory pressure

Hello,

Thanks for the response.

#1 That’s my question to you.. Again, let me re-emphasize that this happened to every single machine across different docker image, different app, different language, different process group, different cpu group (perf vs shared) and at the same time. The only thing common about these machines is that all these machines were in one org account in one region NRT. So I highly doubt this has anything to do with my code or particular docker image.

#2. I’m not sure if I understand the question. Are you asking why using swap would make my app slower compared to using RAM? That particular app mentioned in the original post was affected more significantly because it runs libSQL sqld for database, not sure if this answers your question and I don’t think this is too relevant because again, the memory reallocation happened to every single machine in my org in a region.

blow is p90/p95 of write operation to libsql on one of the affect apps during the affected time.

My enquiry isn’t about the performance hit. If my app starts to use swap over memory it will be slower. My enquiry is why all of the machines would suddenly decide to use swaps when there are plenty memory left. total memory usage didn’t grow. it just reallocated to swap.

I just checked and it is still happening even after restart.

I would really appreciate any insight on this.

Only machine I did not restart.

meminfo

root@080e6e1fd66708:/app# cat /proc/meminfo
MemTotal:         985316 kB
MemFree:          732868 kB
MemAvailable:     708996 kB
Buffers:            1784 kB
Cached:            93652 kB
SwapCached:        52460 kB
Active:            26544 kB
Inactive:         135080 kB
Active(anon):      10528 kB
Inactive(anon):    59928 kB
Active(file):      16016 kB
Inactive(file):    75152 kB
Unevictable:        3072 kB
Mlocked:               0 kB
SwapTotal:        524284 kB
SwapFree:         313016 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         65460 kB
Mapped:            43624 kB
Shmem:              4304 kB
KReclaimable:       3520 kB
Slab:              18184 kB
SReclaimable:       3520 kB
SUnreclaim:        14664 kB
KernelStack:        1516 kB
PageTables:         3388 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1016940 kB
Committed_AS:     417604 kB
VmallocTotal:   34359738367 kB
VmallocUsed:        8016 kB
VmallocChunk:          0 kB
Percpu:              268 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:       16384 kB
DirectMap2M:     1032192 kB
DirectMap1G:           0 kB

Found out that all of these machines were (still are) using
FROM denoland/deno:2.6.4 AS base as base image, so they were all on the same docker image.
I apologise for the confusion. Much more likely it has to do with the application than fly machines.

Seems like the behaviour is back to normal since February 14th.

I have not made any changes to any of these machines so I’m still quite confused.

Regardless, we plan on updating our docker image and hopefully don’t encounter this anymore.

Thanks for taking a look at it!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.