Sooo I’m back in business. Not a fancy and sophisticated solution here, but a short retrospective of sorts in case it helps someone in the future.
What happened
- Sidekiq workers froze and pg seemed to leave the connections open.
- Some instances started being killed with OOMs. Workers would pick up work, and then freeze.
- Rolling deployments would exacerbate the problem, exhausting pg connections.
Then I
3. Restarted the pg instance
3. Redeployed my app, this time without any workers, just the web process.
4. Uncommented the workers process types in my fly.toml and brought them back to life.
A few interesting side effects listed below. They aren’t facts, just observations that caught my attention:
- Seems that fly.io is a bit more protective of memory, sending OOMs more often (compared to the previous few months where my app could go above the 2GB limit for a little while)
- I was puzzle to see the dashboards, my app consuming 900MB/2GB and still getting OOM killed.
- One of my apps weirdly lost a couple of its secrets during the process.