Postgres DBConnection and other errors on Fly

We just migrated our app’s Postgres database from DigitalOcean to Fly.io yesterday and have suddenly started seeing a large number of DBConnection and other Postgrex related errors that we never encountered on DigitalOcean (even when the DB was still on DO, and the app had already been migrated to Fly).


Nothing special about how we set it up, just followed the instructions in the docs:

▲ fly status -a urbanave-db
App
  Name     = urbanave-db
  Owner    = urban-avenue
  Version  = 2
  Status   = running
  Hostname = urbanave-db.fly.dev

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS           	HEALTH CHECKS                 	RESTARTS	CREATED
8c2183d4	app    	2      	dfw   	run    	running (replica)	3 total, 3 passing            	0       	2022-06-05T12:17:37Z
4047c06c	app    	2      	dfw   	run    	running (leader) 	3 total, 2 passing, 1 critical	0       	2022-06-05T12:17:27

The database is being used by two Fly apps (a web service and a worker service). There isn’t a lot of usage too (we only have about 100 users at the moment). I’m not sure what the cause is or what I can do to resolve it.

Any thoughts?

Error Messages

Here’s a summary of the various error messages:

(DBConnection.ConnectionError tcp recv (idle): closed)
(DBConnection.ConnectionError tcp recv: closed)
(Postgrex.Error FATAL 57P03 (cannot_connect_now) the database system is in recovery mode)
(Postgrex.Error ERROR 25006 (read_only_sql_transaction) cannot execute UPDATE in a read-only transaction)
Postgrex.Protocol (#PID<0.2212.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv (idle): closed
Postgrex.Protocol (#PID<0.2221.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.20198.0> timed out because it queued and checked out the connection for longer than 15000ms
Postgrex.Protocol (#PID<0.2220.0>) disconnected: ** (Postgrex.Error) FATAL 57P01 (admin_shutdown) terminating connection due to administrator command

Note that we haven’t made any changes to the database since yesterday, yet the errors are all recent. It doesn’t look like an issue with the connection pool either.

Does this have the same amount of RAM as your DigitalOcean server? Those error messages make it look like something is crashing postgres and making it failover. This is most frequently an OOM symptom.

You can run fly vm status <id> against your postgres instances to see if there have been errors.

This could also be a connection limit problem. fly checks list will show you why that one instance is failing health checks.

I’ll increase memory and check again, thanks. Here’s the checks list output btw:

▲ fly checks list
Health Checks for urbanave-db
  NAME | STATUS   | ALLOCATION | REGION | TYPE | LAST UPDATED | OUTPUT
-------*----------*------------*--------*------*--------------*-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  pg   | passing  | 8c2183d4   | dfw    | HTTP | 4m24s ago    | HTTP GET http://172.19.11.162:5500/flycheck/pg: 200 OK Output: "[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | transactions: readonly (688.96µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | replication: syncing from fdaa:0:544a:a7b:12de:1:1639:2 (151.81µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | connections: 7 used, 3 reserved, 300 max (5.43ms)"[✓]
       |          |            |        |      |              |
       |          |            |        |      |              |
  vm   | passing  | 8c2183d4   | dfw    | HTTP | 6m18s ago    | HTTP GET http://172.19.11.162:5500/flycheck/vm: 200 OK Output: "[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | checkDisk: 8.91 GB (91.1%) free space on /data/ (38.17µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | checkLoad: load averages: 0.07 0.05 0.02 (62.87µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | memory: system spent 0s of the last 60s waiting on memory (31.71µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | cpu: system spent 1.48s of the last 60s waiting on cpu (21.28µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | io: system spent 12ms of the last 60s waiting on io (19.54µs)"[✓]
       |          |            |        |      |              |
       |          |            |        |      |              |
  role | passing  | 8c2183d4   | dfw    | HTTP | 3h15m ago    | replica[✓]
       |          |            |        |      |              |
       |          |            |        |      |              |
  vm   | critical | 4047c06c   | dfw    | HTTP | 4m55s ago    | HTTP GET http://172.19.7.218:5500/flycheck/vm: 500 Internal Server Error Output: "[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | checkDisk: 8.86 GB (90.6%!)(MISSING) free space on /data/ (155.38µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | checkLoad: load averages: 0.47 0.50 0.38 (168.56µs)\n[✗]memory: system spent 1.1s of the last 10 seconds waiting on memory (48.43µs)\n[✗]cpu: system spent 1.26s of the last 10 seconds waiting on cpu (41.3µs)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | io: system spent 3.84s of the last 60s waiting on io (39.73µs)"[✓]
       |          |            |        |      |              |
       |          |            |        |      |              |
  pg   | passing  | 4047c06c   | dfw    | HTTP | 6m13s ago    | HTTP GET http://172.19.7.218:5500/flycheck/pg: 200 OK Output: "[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | transactions: read/write (2.16ms)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | replicationLag: fdaa:0:544a:a7b:1a:1:1638:2 is lagging 0s (100ns)\n[✓]
       |          |            |        |      |              |
       |          |            |        |      |              | connections: 51 used, 3 reserved, 300 max (9.26ms)"[✓]
       |          |            |        |      |              |
       |          |            |        |      |              |
  role | passing  | 4047c06c   | dfw    | HTTP | 1h19m ago    | leader[✓]
       |          |            |        |      |              |
       |          |            |        |      |              |

After scaling both the VM and memory, and observing Sentry for the past 2 days, I can report that almost all errors keep occurring, though the most common ones now are:

Postgrex.Protocol (#PID<0.2221.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.21635.1> timed out because it queued and checked out the connection for longer than 15000ms
Postgrex.Protocol (#PID<0.2212.0>) disconnected: ** (Postgrex.Error) FATAL 57P01 (admin_shutdown) terminating connection due to administrator command

Again, note that we do not get any of these errors on our previous Database instance. So the first one is especially curious as to why would the database connection timeout now.

For the second one, a quick google search reveals two reasons: Stopping the instance (which we haven’t done) and Inserting records into replicas (maybe because the DB is scaled to 2 instances?)

@sheharyar Did you ever figure out this problem? I’ve been experiencing some of the same problems, notably the terminating connection due to administrator command and cannot execute UPDATE in a read-only transaction) when I shouldn’t be connecting to the read only replica.

I scaled down the database to one instance (i.e. removed replicas) while keeping the CPU/RAM the same (dedicated-cpu-1x / 2GB) about 5 days ago and that seemed to have fixed most of the errors.

I no longer see the admin_shutdown errors but still see the timed out errors (although with much less frequency).

Thanks for the info. I think I’m going to have to scale down to one instance as well. I think some of my non-multi region apps keep connecting to the replica even though I want it to connect to the leader.

I get similar errors even with one instance from time to time:

(DBConnection.ConnectionError connection not available and request was dropped from queue after 1486ms.

Same here, a few db connection errors during a 5-10min window usually.

When I run fly checks list, the error seems to be related to IO:
“system spent 9.08s of the last 10 seconds waiting on io (28.35µs)”

Here is the full result of fly checks list:

NAME | STATUS   | MACHINE        | LAST UPDATED | OUTPUT
-------*----------*----------------*--------------*----------------------------------------------------------------------------
  pg   | passing  | 148e394a45dd89 | 3m2s ago     | [✓] transactions: read/write (300.58µs)
       |          |                |              | [✓] connections: 21 used, 3 reserved, 300 max (4.82ms)
-------*----------*----------------*--------------*----------------------------------------------------------------------------
  role | passing  | 148e394a45dd89 | 3m0s ago     | leader
-------*----------*----------------*--------------*----------------------------------------------------------------------------
  vm   | critical | 148e394a45dd89 | 9m4s ago     | 500 Internal Server Error
       |          |                |              | [✓] checkDisk: 782.25 MB (80.2%!)(MISSING) free space on /data/ (47.15µs)
       |          |                |              | [✓] checkLoad: load averages: 0.39 0.47 1.07 (67.89µs)
       |          |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (37.76µs)
       |          |                |              | [✓] cpu: system spent 750ms of the last 60s waiting on cpu (34.25µs)
       |          |                |              | [✗] io: system spent 9.08s of the last 10 seconds waiting on io (28.35µs)
-------*----------*----------------*--------------*----------------------------------------------------------------------------
1 Like