Worker out of memory

Hi all,

I am looking for help with my django-powered app as I am experiencing performance issues that I cannot find a solution for. Here are the logs:

$ fly logs -a bl-app

Waiting for logs…

2025-06-10T11:46:21.539 app[148e276f172968] ams [info] [2025-06-10 11:46:21 +0000] [632] [CRITICAL] WORKER TIMEOUT (pid:639)

2025-06-10T11:46:21.542 app[148e276f172968] ams [info] [2025-06-10 11:46:21 +0000] [632] [CRITICAL] WORKER TIMEOUT (pid:640)

2025-06-10T11:46:21.619 app[148e276f172968] ams [info] [2025-06-10 13:46:21 +0200] [639] [INFO] Worker exiting (pid: 639)

2025-06-10T11:46:21.860 app[148e276f172968] ams [info] [2025-06-10 13:46:21 +0200] [640] [INFO] Worker exiting (pid: 640)

2025-06-10T11:46:23.219 app[148e276f172968] ams [info] [2025-06-10 11:46:23 +0000] [632] [ERROR] Worker (pid:640) was sent SIGKILL! Perhaps out of memory?

2025-06-10T11:46:23.459 app[148e276f172968] ams [info] [2025-06-10 11:46:23 +0000] [632] [ERROR] Worker (pid:639) was sent SIGKILL! Perhaps out of memory?

2025-06-10T11:46:23.460 app[148e276f172968] ams [info] [2025-06-10 11:46:23 +0000] [644] [INFO] Booting worker with pid: 644

2025-06-10T11:46:24.020 app[148e276f172968] ams [info] [2025-06-10 11:46:23 +0000] [645] [INFO] Booting worker with pid: 645

2025-06-10T11:46:57.655 app[148e276f172968] ams [info] [WSGI] Memory on startup: 39.02 MB

2025-06-10T11:46:57.655 app[148e276f172968] ams [info] [WSGI] Setting default DJANGO_SETTINGS_MODULE

2025-06-10T11:46:57.659 app[148e276f172968] ams [info] [WSGI] Calling get_wsgi_application()

2025-06-10T11:46:57.667 app[148e276f172968] ams [info] [WSGI] Memory on startup: 39.02 MB

2025-06-10T11:46:57.667 app[148e276f172968] ams [info] [WSGI] Setting default DJANGO_SETTINGS_MODULE

2025-06-10T11:46:57.667 app[148e276f172968] ams [info] [WSGI] Calling get_wsgi_application()

2025-06-10T11:47:54.423 app[148e276f172968] ams [info] [2025-06-10 11:47:53 +0000] [632] [CRITICAL] WORKER TIMEOUT (pid:644)

2025-06-10T11:47:54.424 app[148e276f172968] ams [info] [2025-06-10 11:47:54 +0000] [632] [CRITICAL] WORKER TIMEOUT (pid:645)

2025-06-10T11:47:54.579 app[148e276f172968] ams [info] [2025-06-10 13:47:54 +0200] [645] [INFO] Worker exiting (pid: 645)

2025-06-10T11:47:54.580 app[148e276f172968] ams [info] [2025-06-10 13:47:54 +0200] [644] [INFO] Worker exiting (pid: 644)

2025-06-10T11:47:56.098 app[148e276f172968] ams [info] [2025-06-10 11:47:56 +0000] [632] [ERROR] Worker (pid:644) was sent SIGKILL! Perhaps out of memory?

2025-06-10T11:47:56.099 app[148e276f172968] ams [info] [2025-06-10 11:47:56 +0000] [632] [ERROR] Worker (pid:645) was sent SIGKILL! Perhaps out of memory?

2025-06-10T11:47:56.183 app[148e276f172968] ams [info] [2025-06-10 11:47:56 +0000] [648] [INFO] Booting worker with pid: 648

2025-06-10T11:47:56.499 app[148e276f172968] ams [info] [2025-06-10 11:47:56 +0000] [649] [INFO] Booting worker with pid: 649

The application logs suggest a potential memory issue, but I’m skeptical, as even static pages occasionally fail to load. This intermittent behavior is key: the app functions perfectly at times, then unexpectedly presents the errors shown in the logs. I haven’t yet identified a pattern, though a common scenario is a sudden unresponsiveness, even for simple pages.

To investigate memory, I added a print statement to wsgi.py, which shows usage around a low 40 MB. I’ve already attempted various fixes, including disabling Sentry, increasing worker timeouts to 90 seconds, and optimizing database queries with prefetch_related and select_related, but the problem persists randomly.

Thanks for your input. Do you have any further ideas?

Note: I have had a similar issue about half a year ago which I was able to resolve by optimizing the database queries. See here.

Your prior fixes are a bit confusing. Is Sentry also a backend agent? I tend to understand it as a JavaScript-based frontend errors collator. I am trying to work out in what way memory would be saved in the backend.

Are the workers Django listeners? How many does Django start up, what is the max level, and how much RAM does your VM have? 90 seconds timeout is far too long IMO; if resources are stretched so thinly that a connection is left unresolved for that length of time, the browser will have disconnected.

What database are you using?

(Side note: I’m only one opinon, and maybe I’m getting old, but you don’t need AI text generation tools to talk to people. The output is unfortunately rather synthetic. I’d rather see grammar errors! :robot:)

Hi @halfer,

Many thanks for your reply. To answer your questions:

- Sentry also a backend agent? I have used sentry to collect issues with my app such as N+1 queries etc.
- Are the workers Django listeners? I am using Gunicorn to serve the app. The workers are Gunicorn workers.
- How many does Django start up? I’m running the app with Gunicorn using 2 workers, as specified in the Dockerfile:

CMD ["gunicorn", "--bind", ":8000", "--workers", "2", "--timeout", "90", "bl_project.wsgi"]

So Django doesn’t start any additional processes itself; it’s Gunicorn that manages request handling. Each of the 2 workers handles requests independently. I’m currently using the default synchronous worker class.

- How much RAM does your VM have? I am using two VMs with 1024 MB each.
- What database are you using? Postgres

One more addition. I have added a health check to my fly.toml file. That pages simply returns a json with a simple ok message, i.e. no database access and even that page does not load at times, i.e. sometimes it does and sometimes it doesn’t.

from django.http import JsonResponse
def health_check(request):
    print("[Health] /health/ was called")
    return JsonResponse({"status": "ok"})

urlpatterns.append(
    path("health/", health_check, name="health-check")
)

Thought that could be additional helpful information as it makes the whole situation even more of a riddle for me.

1 Like

Right! That is a much better post, thanks.

Does Gunicorn spin up more workers based on load? There are more than two in your logs, though I wonder if it is merely respawning more to keep two running where earlier ones exit or are killed.

[I am using] Postgres

Is the server running externally e.g. a managed service? I assume the db server itself is not running on one of the 1G RAM VMs.

I have added a health check to my fly.toml file. That pages simply returns a json with a simple ok message, i.e. no database access and even that page does not load at times, i.e. sometimes it does and sometimes it doesn’t.

Yes, that’s a good observation. I wonder if your workers are becoming unavailable. I am not sure why that would be.

Next I would add a process to do memory logging every ten seconds (even the output of free -h would be super valuable). You can send that to stdout, so it appears interleaved in the same logs. The normal reason for a process kill is running out of memory, but it would be worth confirming that theory.