PostgreSQL Database in Failing State

sheharyar · July 18, 2022, 9:46pm

A few hours ago my database app on Fly started failing again, but this time with a different error.

Here’s the output of fly status:

App
  Name     = urbanave-db
  Owner    = urban-avenue
  Version  = 10
  Status   = running
  Hostname = urbanave-db.fly.dev

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS            	HEALTH CHECKS                 	RESTARTS	CREATED
308b15e8	app    	10     	dfw   	run    	running (readonly)	3 total, 1 passing, 2 critical	0       	2022-06-30T21:04:47Z

I tried restarting and scaling down and back up as mentioned in my previous post but that did not work this time, and the VM status is always running (readonly) with 2 critical health checks.

Here’s the output of the vm status <instance> after restarting:

▲ fly vm status 2751e7ee -a urbanave-db
Instance
  ID            = 2751e7ee
  Process       =
  Version       = 15
  Region        = dfw
  Desired       = run
  Status        = running (replica)
  Health Checks = 3 total, 1 passing, 2 critical
  Restarts      = 0
  Created       = 2m43s ago

Recent Events
TIMESTAMP            TYPE       MESSAGE
2022-07-18T21:37:36Z Received   Task received by client
2022-07-18T21:37:36Z Task Setup Building Task Directory
2022-07-18T21:37:40Z Started    Task started by client

Checks
ID   SERVICE STATE    OUTPUT
vm   app     critical HTTP GET http://172.19.20.130:5500/flycheck/vm: 500 Internal Server Error Output: "[✓] checkDisk: 7.95 GB (81.3%!)(MISSING) free space on /data/ (49.24µs)\n[✓] checkLoad: load averages: 0.02 0.07 0.28 (135.87µs)\n[✓] memory: system spent 0s of the last 60s waiting on memory (165.06µs)\n[✗] cpu: system spent 1.54s of the last 10 seconds waiting on cpu (74.61µs)\n[✓] io: system spent 1.08s of the last 60s waiting on io (29.47µs)"

role app     passing  replica
pg   app     critical HTTP GET http://172.19.20.130:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"


Recent Logs
  2022-07-18T21:37:46.000 [info] proxy    | [WARNING] 198/213746 (566) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
  2022-07-18T21:37:47.000 [info] checking stolon status
  2022-07-18T21:37:47.000 [info] proxy    | [WARNING] 198/213747 (566) : Server bk_db/pg2 is DOWN, reason: Layer7 timeout, check duration: 5000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
  2022-07-18T21:37:47.000 [info] keeper   | 2022-07-18T21:37:47.444Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
  2022-07-18T21:37:48.000 [info] checking stolon status
  2022-07-18T21:37:48.000 [info] keeper is healthy, db is healthy, role: standby
  2022-07-18T21:37:49.000 [info] proxy    | [WARNING] 198/213749 (566) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
  2022-07-18T21:37:49.000 [info] proxy    | [NOTICE] 198/213749 (566) : haproxy version is 2.2.9-2+deb11u3
  2022-07-18T21:37:49.000 [info] proxy    | [NOTICE] 198/213749 (566) : path to executable is /usr/sbin/haproxy
  2022-07-18T21:37:49.000 [info] proxy    | [ALERT] 198/213749 (566) : backend 'bk_db' has no server available!
  2022-07-18T21:37:49.000 [info] keeper   | 2022-07-18T21:37:49.944Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
  2022-07-18T21:37:52.000 [info] keeper   | 2022-07-18T21:37:52.448Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
  2022-07-18T21:37:52.000 [info] keeper   | 2022-07-18 21:37:52.753 UTC [619] LOG:  starting PostgreSQL 14.4 (Debian 14.4-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
  2022-07-18T21:37:52.000 [info] keeper   | 2022-07-18 21:37:52.754 UTC [619] LOG:  listening on IPv6 address "fdaa:0:544a:a7b:1a:1:1638:2", port 5433
  2022-07-18T21:37:52.000 [info] keeper   | 2022-07-18 21:37:52.754 UTC [619] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5433"
  2022-07-18T21:37:52.000 [info] keeper   | 2022-07-18 21:37:52.757 UTC [620] LOG:  database system was shut down in recovery at 2022-06-30 21:03:52 UTC
  2022-07-18T21:37:52.000 [info] keeper   | 2022-07-18 21:37:52.758 UTC [620] LOG:  entering standby mode
  2022-07-18T21:37:52.000 [info] keeper   | 2022-07-18 21:37:52.761 UTC [620] LOG:  redo starts at 1/A1254378
  2022-07-18T21:37:52.000 [info] keeper   | 2022-07-18 21:37:52.827 UTC [620] LOG:  consistent recovery state reached at 1/A1947F58
  2022-07-18T21:37:52.000 [info] keeper   | 2022-07-18 21:37:52.828 UTC [619] LOG:  database system is ready to accept read-only connections
  2022-07-18T21:38:02.000 [info] exporter | INFO[0022] Established new database connection to "fdaa:0:544a:a7b:1a:1:1638:2:5433".  source="postgres_exporter.go:970"
  2022-07-18T21:38:02.000 [info] exporter | INFO[0022] Semantic Version Changed on "fdaa:0:544a:a7b:1a:1:1638:2:5433": 0.0.0 -> 14.4.0  source="postgres_exporter.go:1539"
  2022-07-18T21:38:02.000 [info] exporter | INFO[0022] Established new database connection to "fdaa:0:544a:a7b:1a:1:1638:2:5433".  source="postgres_exporter.go:970"
  2022-07-18T21:38:02.000 [info] exporter | INFO[0022] Semantic Version Changed on "fdaa:0:544a:a7b:1a:1:1638:2:5433": 0.0.0 -> 14.4.0  source="postgres_exporter.go:1539"
  2022-07-18T21:38:10.000 [info] keeper   | 2022-07-18 21:38:10.571 UTC [745] LOG:  started streaming WAL from primary at 2/5A000000 on timeline 29

sheharyar · July 18, 2022, 9:49pm

As I was writing my previous post, the DB issue fixed itself without any additional changes from me.

I’m still not sure why various DB issues keep happening and what the solution would be here.

kurt · July 18, 2022, 9:52pm

Will you try running fly image update to upgrade your DB? There are some changes that might help recover when it fails like that.

This failure makes it look like it’s maxing CPU:

vm   app     critical HTTP GET http://172.19.20.130:5500/flycheck/vm: 500 Internal Server Error Output: "[✓] checkDisk: 7.95 GB (81.3%!)(MISSING) free space on /data/ (49.24µs)\n[✓] checkLoad: load averages: 0.02 0.07 0.28 (135.87µs)\n[✓] memory: system spent 0s of the last 60s waiting on memory (165.06µs)\n[✗] cpu: system spent 1.54s of the last 10 seconds waiting on cpu (74.61µs)\n[✓] io: system spent 1.08s of the last 60s waiting on io (29.47µs)"

This one means we can’t connect to postgres:

  2022-07-18T21:37:46.000 [info] proxy    | [WARNING] 198/213746 (566) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

It seems like your DB might be getting overloaded for a short time and takes a bit to recover.

sheharyar · July 18, 2022, 10:22pm

I actually did try that earlier when I saw your comment on this post. But it did not work either as the command got stuck on the health checks, so I ended up canceling it:

▲ fly image update -c fly.db.toml
? Update `urbanave-db` from flyio/postgres:14.2 v0.0.22 to flyio/postgres:14.4 v0.0.25? Yes
Release v13 created

You can detach the terminal anytime without stopping the update
==> Monitoring deployment

 1 desired, 1 placed, 0 healthy, 0 unhealthy [health checks: 3 total, 1 passing, 2 critical]
^CFailed Instances
failed fetching alloc 9da7f547-7d75-e914-b20d-a093c8ec8985: context canceled%

But now I see that the image was still updated as the postgres version now shows 14.4 and rerunning the command shows:

▲ fly image update -c fly.db.toml
Error image is already running the latest image

Why would this happen now when the DB usually works without issues?

Can you clarify what you mean by “overloaded”? What can we do to avoid this and other DB issues?

kurt · July 18, 2022, 10:29pm

You may need to upgrade to dedicated-cpu-2x to get it more processors. Based on the error in the logs and the health check output, it looks like the DB is CPU constrained.