Sidekiq workers froze, now deployments are stuck

Hey guys,

Some time around Sunday 5/22 and today 5/25 GMT my Sidekiq workers froze. They been humming for months, with a few deployments in between. It’s a Rails app on Ruby 3.0.2. with multiple process types on lax : the usual web and a handful of separate worker processes.
Some of them with 5 threads enabled, since a good chunk of the time they are waiting for IO.

The apps are ifs-web-production and web-staging. They been running nicely for months. Actually 2 months ago I upgraded the postgres instances to adapt to the new workloads. It’s been smooth sailing ever since.

I thought it could be due to the VM using shared-cpu-1x. Even though my understanding was that each process uses a different VM. I tried upgrading them to dedicated-cpu-x1 with no luck.

Now I’m trying to deploy a version with less process types and decreased concurrency, grouping all workers into a single process with 1 thread consuming from all my queues.

Unfortunately, deployments don’t seem to make it through, they get stuck in the Running release taks (running) and I’d like to ask for some help. Been tapping on the fly logs but I can’t get to see an error stacktrace, just the following output:

2022-05-25T12:02:59Z app[eb1add2b] lax [info]Starting init (commit: aa54f7d)...
2022-05-25T12:02:59Z app[eb1add2b] lax [info]Preparing to run: `launcher rails db:migrate data:migrate` as heroku
2022-05-25T12:02:59Z app[eb1add2b] lax [info]2022/05/25 12:02:59 listening on [fdaa:0:3197:a7b:85:eb1a:dd2b:2]:22 (DNS: [fdaa::3]:53)

Any help would be appreciated!

Sooo I’m back in business. Not a fancy and sophisticated solution here, but a short retrospective of sorts in case it helps someone in the future.

What happened

  1. Sidekiq workers froze and pg seemed to leave the connections open.
  2. Some instances started being killed with OOMs. Workers would pick up work, and then freeze.
  3. Rolling deployments would exacerbate the problem, exhausting pg connections.

Then I
3. Restarted the pg instance
3. Redeployed my app, this time without any workers, just the web process.
4. Uncommented the workers process types in my fly.toml and brought them back to life.

A few interesting side effects listed below. They aren’t facts, just observations that caught my attention:

  1. Seems that fly.io is a bit more protective of memory, sending OOMs more often (compared to the previous few months where my app could go above the 2GB limit for a little while)
  2. I was puzzle to see the dashboards, my app consuming 900MB/2GB and still getting OOM killed.
  3. One of my apps weirdly lost a couple of its secrets during the process.

:wave: :rocket: