We just migrated our app’s Postgres database from DigitalOcean to Fly.io yesterday and have suddenly started seeing a large number of DBConnection and other Postgrex related errors that we never encountered on DigitalOcean (even when the DB was still on DO, and the app had already been migrated to Fly).
Nothing special about how we set it up, just followed the instructions in the docs:
▲ fly status -a urbanave-db
App
Name = urbanave-db
Owner = urban-avenue
Version = 2
Status = running
Hostname = urbanave-db.fly.dev
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
8c2183d4 app 2 dfw run running (replica) 3 total, 3 passing 0 2022-06-05T12:17:37Z
4047c06c app 2 dfw run running (leader) 3 total, 2 passing, 1 critical 0 2022-06-05T12:17:27
The database is being used by two Fly apps (a web service and a worker service). There isn’t a lot of usage too (we only have about 100 users at the moment). I’m not sure what the cause is or what I can do to resolve it.
Postgrex.Protocol (#PID<0.2221.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.20198.0> timed out because it queued and checked out the connection for longer than 15000ms
Postgrex.Protocol (#PID<0.2220.0>) disconnected: ** (Postgrex.Error) FATAL 57P01 (admin_shutdown) terminating connection due to administrator command
Note that we haven’t made any changes to the database since yesterday, yet the errors are all recent. It doesn’t look like an issue with the connection pool either.
Does this have the same amount of RAM as your DigitalOcean server? Those error messages make it look like something is crashing postgres and making it failover. This is most frequently an OOM symptom.
You can run fly vm status <id> against your postgres instances to see if there have been errors.
This could also be a connection limit problem. fly checks list will show you why that one instance is failing health checks.
After scaling both the VM and memory, and observing Sentry for the past 2 days, I can report that almost all errors keep occurring, though the most common ones now are:
Postgrex.Protocol (#PID<0.2221.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.21635.1> timed out because it queued and checked out the connection for longer than 15000ms
Postgrex.Protocol (#PID<0.2212.0>) disconnected: ** (Postgrex.Error) FATAL 57P01 (admin_shutdown) terminating connection due to administrator command
Again, note that we do not get any of these errors on our previous Database instance. So the first one is especially curious as to why would the database connection timeout now.
For the second one, a quick google search reveals two reasons: Stopping the instance (which we haven’t done) and Inserting records into replicas (maybe because the DB is scaled to 2 instances?)
@sheharyar Did you ever figure out this problem? I’ve been experiencing some of the same problems, notably the terminating connection due to administrator command and cannot execute UPDATE in a read-only transaction) when I shouldn’t be connecting to the read only replica.
I scaled down the database to one instance (i.e. removed replicas) while keeping the CPU/RAM the same (dedicated-cpu-1x / 2GB) about 5 days ago and that seemed to have fixed most of the errors.
I no longer see the admin_shutdown errors but still see the timed out errors (although with much less frequency).
Thanks for the info. I think I’m going to have to scale down to one instance as well. I think some of my non-multi region apps keep connecting to the replica even though I want it to connect to the leader.