Celery Beat process hangs silently after several hours

Hi Fly.io Community,

I’m running into a persistent issue with Django + Celery Beat on Fly.io and would love to get input from others who might have experienced similar problems.

The Problem

My celery beat process occasionally hangs silently after several hours of operation. The symptoms are:

  • Fly.io machine shows as “live” in the web UI

  • No errors in the logs - beat process appears to be running

  • Celery beat stops scheduling tasks entirely (tasks should run every minute)
    eg each minute should show log like:
    2025-07-16 18:32:00.011 Scheduler: Sending due task stop_expired_sessions (apps.sessions.tasks.stop_expired_sessions_task)

  • Machine has spare resources (256MB memory with ~40mb unused, minimal CPU usage)

  • Only solution is manual restart of the beat process, or manual stop/start of Machine

This is critical because my app relies on minute-by-minute task scheduling, and silent failures mean tasks just… stop happening.

Current Fly.toml Configuration following other examples

[processes]
app = “daphne -b 0.0.0.0 -p 8000 myApp.asgi:application”
worker = “celery -A myApp worker -l INFO”
beat = “celery -A myApp beat -l INFO”

My Workaround Approach

I’ve implemented a supervisord-based healthcheck system that:

  1. Detects hanging processes: Checks celery beat’s schedule file every 10 seconds

  2. Aggressive restart: If tasks are >15 seconds late, immediately restart beat

  3. No custom HTTP endpoints: Uses Django management command for healthcheck logic

Hopefully this will let me stick to the minute-based scheduling requirements.
BUT: it’s yet more custom code and I’ve never had to do this before, in years of running Celery Beat in its own container/pod in Kubernetes, where it would run for months without issue. So I’m really confused as to what’s going on.

Questions for the Community

  1. Has anyone else experienced silent celery beat hangs on Fly.io?

  2. Are there Fly.io-specific configurations that might prevent this?

  3. Any recommendations for monitoring background processes beyond standard logs?

  4. Alternative approaches to ensure reliable task scheduling?

I’m particularly interested in whether this is a known issue with long-running background processes on Fly.io, or if there are platform-specific best practices I should be following.

Thanks for any insights!

Are you saying the machine is using ~216MB? At the min 256MB config, 40MB free isn’t that much since there’s a lot of overhead w/ the OS. Every time my machine reaches 190MB capacity, it starts to lag and the latency spikes.

There is at least 40-50Mb unused memory available to the container at the time this happens, and CPU usage is less than 5%. If you’re suggesting that these metrics as reported via Grafana are misleading, please link to where this is documented.
However even if the process ran out of memory, then I would expect to see OOM (out of memory) errors in the logs, which don’t appear.

It doesn’t OOM, it just stays in a zone where it has to share the last bit of memory which causes the big latency spikes.

That is unrelated then to my issue. I don’t see increased latency or anything similar, I see a failure from one minute to the next for the celery beat process to continue scheduling tasks.
And my query about your resources claim stands: if you have unused memory available, then your process is not starved of RAM and therefore something else is causing the issue.

Yea I’m just giving you my experience related to RAM issues. If that’s not your case, then something else is up. Good luck!

1 Like

Hi @driez17 ,

There’s nothing Fly does to selectively “break” celery beat schedules - it’s just a process running on a VM, it should not just fail to do its thing for no reason.

I’d suggest cranking up your log level (you’re on INFO, maybe try DEBUG?) to see what celery itself says about the issue. It sounds like it might be having trouble reaching the message broker and being unable to put jobs there, it’s strange it wouldn’t log anything in this case but I’ve seen Celery in the wild on a non-Fly machine with rabbitmq as the message broker stop working just as mysteriously - it suddenly stops “seeing” the message broker so it doesn’t get any jobs but also doesn’t log anything to the case.

Back then the ‘solution’ was similar to what you did: we had an alert that fired when the job queue grew beyond a certain limit and then we’d go in and kick the Celery workers which promptly restarted working. To be fair, if the scheduler or job runner are business-critical, having monitoring and automatic mitigation in place sounds like a good idea, even if the processes are running reliably.

It would be interesting to know which broker you’re using (I speculate redis, let me know if I’m wrong) and your config options for Celery (which likely apply to Beat as well).

  • Daniel
1 Like

I agree with this; if there’s only 40M free in the 256M container, I’d advocate going for a larger machine for a week, and seeing if the problem reoccurs. It’s still a cheap machine, and it would rule out a class of errors.

I’d add, in the spirit of helpfulness, that the whole thread feels like a debugging problem. It isn’t feasible for every monitoring edge-case to be documented. In general if a process stops working, the reason probably is in the logs (either at the application or container level), and as Daniel says, sometimes one has to turn up the log level to see it.

1 Like

Thanks for the suggestions. I did try running celery beat in DEBUG, but as expected it didn’t yield anything new. The beat process simply stops firing from one minute to the next, eg:

|2025-07-21 00:29:00.024|Scheduler: Sending due task stop_expired_sessions (apps.sessions.tasks.stop_expired_sessions_task)
|2025-07-21 00:30:00.052|Scheduler: Sending due task stop_expired_sessions (apps.sessions.tasks.stop_expired_sessions_task)
|2025-07-21 00:31:00.045|Scheduler: Sending due task stop_expired_sessions (apps.sessions.tasks.stop_expired_sessions_task)
<NO MORE LOG LINES AFTER THIS>

It’s certainly true that anything critical should be monitored and mitigated - I just wasn’t expecting to have to put such measures in place so early in the deployment. And it’s good to know what’s actually going on with your core processes!

I am indeed using Redis and the Celery config is fairly vanilla:

###################################################################
# CELERY SETTINGS
###################################################################
CELERY_BROKER_URL = REDIS_URL
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_BACKEND = REDIS_URL
CELERY_RESULT_PERSISTENT = False
CELERY_RESULT_SERIALIZER = 'json'
CELERY_IMPORTS = [ 'apps.sessions.tasks' ]
CELERY_TIMEZONE = TIME_ZONE # Use Django's timezone
CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True # Recommended for robust startup
# task_routes
CELERY_TASK_ROUTES = {
    'apps.sessions.tasks*': {
        'queue': 'asyncwork'
    },             # <-- glob pattern
}

CELERY_BEAT_SCHEDULE = {
    "stop_expired_sessions": {
        "task": "apps.sessions.tasks.stop_expired_sessions_task",
        "schedule": crontab(),  # every minute
    },
}

If you have any other thoughts, I’d love to hear them :slight_smile:

Having run containers in many different runtimes, including similar deployments to here, I don’t think increasing RAM on a container which shows static memory usage over hours and continued available memory among the many supplied metrics in Grafana (and thus no memory leak or similar behavior indicated) would result in any solution.
And my point about resource utilization reporting has nothing to do with any edge-case, monitoring or otherwise. The point is that the container runtime being employed allocates a set amount of memory which is usable by the container, and does not include any utilization employed by the host OS which is actually running the container runtime.
If you can’t trust the metrics reported, then alerting becomes a real issue. Therefore I requested clarity on the claim “there’s a lot of overhead w/ the OS” as it implied a conflicting understanding of how container resources are allocated by the runtime.
And as I suspected from my experience in running celery beat in various environments, debug logs provided no further information on the issue, making it genuinely anomalous behavior.

So I think it’s a valid query to put to the community, as I’m sure others have or will run into this issue.

Yes, no harm in asking.

Probably not, in my view [citation needed] :zany_face:

I guess I find myself just wanting to save your time, by suggesting what other things you could look into after asking the question. I am reminded of the many times someone has gotten stuck on Stack Overflow “for weeks” because, after asking, they stopped their own avenues of research.

The other suggestion I’d make is asking whether you have a support contract with Fly. I’m just a customer, so I’ve nothing to gain by suggesting it; I think the support here from employees is on the pro bono side, but I think they’ll do more digging into your machines if you’ve a subscription.

1 Like

Thanks for the details!

Is this Upstash Redis or a self-deployed Redis? Could you try the other and see if it helps?

Do you have an estimate of how frequently beat stops working? (asking because I can try standing up a Celery Beat machine on my side and see if I get the same thing happening, just to get another data point - can’t make any promises but I’ll give it a try).

I googled for “celery beat stops” and there’s a good number of matches, so it seems it’s not an uncommon problem. It might be worth trying some of the suggestions they make there.

Just to reiterate that Fly machines are not containers. They’re full-blown VMs (using Amazon Firecracker, so they count as “microvms” but the distinction for the purposes of a mostly server-side app are basically negligible). Save for something going very wrong platform-wide, the entirety of the resources you allocate are available for you to use. You can fly ssh console into the machine and explore like you would any other machine; you can use htop to see what’s going on, understand which processes are using the most memory, how much memory you have, etc. You’ll find a few custom processes (e.g. init is a Fly-managed process and hallpass is our SSH daemon) but otherwise there’s really not a ton of magic here.

  • Daniel
2 Likes

Thanks!

I’m using Upstash Redis, and in particular the one integrated with fly.io, as this seemed the obvious choice when setting up the service on Fly. I haven’t seen this behavior elsewhere, in years of running Django+celery on local docker-compose, local k8s and cloud-hosted k8s.

The celery beat process stops working after typically 4-8 hours of operation, after which I have to either restart the process or machine.

I had assumed that the scheduler output
ie

|2025-07-21 00:29:00.024|Scheduler: Sending due task stop_expired_sessions (apps.sessions.tasks.stop_expired_sessions_task)

would occur regardless of Redis connectivity, but your point makes me question that. Perhaps that’s only output after a successful response from Redis. In which case I’m dealing with an intermittent network-connectivity issue, because the Upstash Redis shows no issues and is there ready to use whenever the celery beat process is restarted.

I appreciate your point re containers vs micro-vms and the resources available to processes.
I have done a bit of poking around already via fly ssh console, but I have yet to modify the image to include the various diagnosis tools. I can if necessary start digging in with strace and tcpdump and get very granular, but in my experience if a common usage pattern running default config is not working, there is something more fundamentally wrong.

Thanks for confirming! My suggestion stands: try starting Redis in a Fly app (a small machine suffices) and seeing if that gets you better stability, just to rule out Upstash Redis being an issue here.

app = "personal-valkey"
primary_region = "dfw"
kill_timeout = "5s"

[build]
    image = "valkey/valkey"
[processes]
app = 'valkey-server --save 60 1 --loglevel debug'

[[vm]]
  size = "shared-cpu-1x"
  memory = "512mb"

[mounts]
  source = "valkey_data"
  destination = "/data"
  initial_size = "3gb"

For what it’s worth, I did set up a trivial beat job running every minute on my Celery-powered app using Upstash Redis as the broker. I deployed this on Jul 21 2025 20:05 UTC, so it’s been going for about 36 hours and it has not stopped or had any problems scheduling jobs. I’m using Celery 5.5.3 if it matters.

It’s entirely possible that the current connection to the broker times out and beat is not reconnecting properly. Restarting beat fixes this, obviously - this would be unreliable disconnect-reconnect behavior from beat and maybe celery itself. This is something I have seen many times where Celery workers disconnect from the broker silently (and I’ve seen this with rabbitmq as well as redis). I would expect this to show in logs - but as I’ve mentioned, this can happen silently which is confusing and hard to debug.

Fair enough, but I’ll note that the default config for Celery workers is using mingle, gossip and heartbeat and you’ll actually find many guides on the Internet recommending disabling those under most circumstances particularly if you only have one worker and/or don’t need a lot of synchronization between workers.

Let me know if these are useful!

  • Daniel
2 Likes

Thanks again for these valuable suggestions.

I did a little further digging and found a similar-sounding issue with a user of DigitalOcean: https://www.reddit.com/r/django/comments/1lx0kte/celery_just_stops_running_tasks/
I followed their action of disabling Redis as a results backend for Celery, as I am not really using results thus far in my Celery codebase.
Since then I have not seen any further issues with the Celery beat process, and it has run for 16+ hours without missing any scheduling. I’m guessing that you did not enable Redis as a results backend? (I’m on Celery 5.5.2 currently btw)

If you have the inclination/time you could test if you also get a similar “silent hang” with celery beat and results enabled on Redis. This is looking more and more like an upstream Celery bug but at least can be a cautionary tale for Fly users with celery+redis who need redis as a results backend.

Hi again,

You’re right, I’m not storing results because I don’t need them. I think this might be the key of what you’re seeing.

If you’re not consuming the results but didn’t add ignore_result=True to your task, they’re still being stored on the backend - depending on what your results are and how frequently tasks run, is it possible the backend is filling up and/or the main celery queue is being evicted or something similar, and that causes beat to stop working?

I’m going to start a copy of my app with result storage enabled and see how it behaves.

  • Daniel
1 Like

is it possible the backend is filling up and/or the main celery queue is being evicted or something similar, and that causes beat to stop working?

It looks like it will be something along these lines, doesn’t it? Or an equally subtle network connection issue. I haven’t found an upstream issue acknowledging this issue yet, although the DO report is very suggestive.