Unable to restart postgres

My postgres db just stopped working suddenly and I can’t restart it no matter what I do.

h1bjobs/db  $ fly status
ID              STATE   ROLE    REGION  CHECKS                  IMAGE                           CREATED                 UPDATED
287444da0d7738  started error   ewr     3 total, 3 critical     flyio/postgres:14 (v0.0.41)     2024-01-16T22:51:24Z    2024-01-21T00:37:36Z

restarting does not work

h1bjobs/db  $ flyctl machine restart 287444da0d7738
Restarting machine 287444da0d7738
  Waiting for 287444da0d7738 to become healthy (started, 1/3)
Error: failed to restart machine 287444da0d7738: failed to wait for health checks to pass: context deadline exceeded

Hi… The following should tell you more:

fly checks list -a db-app-name
fly logs        -a db-app-name

This is what I see

fly checks list

h1bjobs/db  $ fly checks list -a h1bjobs-db
Health Checks for h1bjobs-db
  NAME | STATUS   | MACHINE        | LAST UPDATED | OUTPUT
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  pg   | critical | 287444da0d7738 | 2h0m ago     | 500 Internal Server Error
       |          |                |              | failed to connect to proxy: context deadline exceeded
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  role | critical | 287444da0d7738 | 1h29m ago    | 500 Internal Server Error
       |          |                |              | failed to connect to local node: context deadline exceeded
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  vm   | critical | 287444da0d7738 | 2h0m ago     | 500 Internal Server Error
       |          |                |              | [✗] checkDisk: 53.82 MB (5.5%!)(MISSING) free space on /data/ (36.3µs)
       |          |                |              | [✓] checkLoad: load averages: 0.00 0.00 0.00 (82.1µs)
       |          |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (38.99µs)
       |          |                |              | [✓] cpu: system spent 360ms of the last 60s waiting on cpu (17.74µs)
       |          |                |              | [✓] io: system spent 360ms of the last 60s waiting on io (48.57µs)
-------*----------*----------------*--------------*--------------------------------------------------------------------------

fly logs

2024-01-21T02:16:40Z app[287444da0d7738] ewr [info]exporter | ERRO[0024] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:5:6ce7:a7b:ce:fd17:60d7:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:5:6ce7:a7b:ce:fd17:60d7:2]:5433: connect: connection refused  source="postgres_exporter.go:1658"
2024-01-21T02:16:41Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21T02:16:41.956Z  ERROR   cmd/keeper.go:719       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]sentinel | 2024-01-21T02:16:42.575Z  ERROR   cmd/sentinel.go:1018    no eligible masters
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.800 UTC [439] LOG:  starting PostgreSQL 14.6 (Debian 14.6-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.801 UTC [439] LOG:  listening on IPv6 address "fdaa:5:6ce7:a7b:ce:fd17:60d7:2", port 5433
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.801 UTC [439] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5433"
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.803 UTC [440] LOG:  database system shutdown was interrupted; last known up at 2024-01-21 02:16:37 UTC
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.874 UTC [440] LOG:  database system was not properly shut down; automatic recovery in progress
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.876 UTC [440] LOG:  redo starts at 15/D7000028
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.876 UTC [440] LOG:  redo done at 15/D7000110 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.884 UTC [440] PANIC:  could not write to file "pg_wal/xlogtemp.440": No space left on device
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.885 UTC [439] LOG:  startup process (PID 440) was terminated by signal 6: Aborted
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.885 UTC [439] LOG:  aborting startup due to startup process failure
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:42.887 UTC [439] LOG:  database system is shut down
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]error connecting to local postgres context deadline exceeded
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]checking stolon status
2024-01-21T02:16:42Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21T02:16:42.987Z  ERROR   cmd/keeper.go:1526      failed to start postgres        {"error": "postgres exited unexpectedly"}
2024-01-21T02:16:43Z app[287444da0d7738] ewr [info]checking stolon status
2024-01-21T02:16:44Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21T02:16:44.457Z  ERROR   cmd/keeper.go:719       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2024-01-21T02:16:44Z app[287444da0d7738] ewr [info]checking stolon status
2024-01-21T02:16:45Z app[287444da0d7738] ewr [info]checking stolon status
2024-01-21T02:16:46Z app[287444da0d7738] ewr [info]checking stolon status
2024-01-21T02:16:46Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21T02:16:46.958Z  ERROR   cmd/keeper.go:719       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
error.message="could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Transport endpoint is not connected (os error 107))" 2024-01-21T02:16:47Z proxy[287444da0d7738] ewr [error]
2024-01-21T02:16:47Z app[287444da0d7738] ewr [info]checking stolon status
2024-01-21T02:16:47Z app[287444da0d7738] ewr [info]sentinel | 2024-01-21T02:16:47.663Z  ERROR   cmd/sentinel.go:1018    no eligible masters
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.049 UTC [472] LOG:  starting PostgreSQL 14.6 (Debian 14.6-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.049 UTC [472] LOG:  listening on IPv6 address "fdaa:5:6ce7:a7b:ce:fd17:60d7:2", port 5433
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.050 UTC [472] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5433"
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.052 UTC [473] LOG:  database system shutdown was interrupted; last known up at 2024-01-21 02:16:42 UTC
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.111 UTC [473] LOG:  database system was not properly shut down; automatic recovery in progress
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.112 UTC [473] LOG:  redo starts at 15/D7000028
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.112 UTC [473] LOG:  redo done at 15/D7000110 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.119 UTC [473] PANIC:  could not write to file "pg_wal/xlogtemp.473": No space left on device
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.119 UTC [472] LOG:  startup process (PID 473) was terminated by signal 6: Aborted
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.119 UTC [472] LOG:  aborting startup due to startup process failure
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21 02:16:48.121 UTC [472] LOG:  database system is shut down
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21T02:16:48.234Z  ERROR   cmd/keeper.go:1526      failed to start postgres        {"error": "postgres exited unexpectedly"}
2024-01-21T02:16:48Z app[287444da0d7738] ewr [info]checking stolon status
error.message="could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Transport endpoint is not connected (os error 107))" 2024-01-21T02:16:49Z proxy[287444da0d7738] ewr [error]
2024-01-21T02:16:49Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21T02:16:49.459Z  ERROR   cmd/keeper.go:719       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
error.message="could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Transport endpoint is not connected (os error 107))" 2024-01-21T02:16:49Z proxy[287444da0d7738] ewr [error]
2024-01-21T02:16:49Z app[287444da0d7738] ewr [info]checking stolon status
2024-01-21T02:16:50Z app[287444da0d7738] ewr [info]checking stolon status
error.message="could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Transport endpoint is not connected (os error 107))" 2024-01-21T02:16:51Z proxy[287444da0d7738] ewr [error]
2024-01-21T02:16:59Z app[287444da0d7738] ewr [info]keeper   | 2024-01-21T02:16:59.463Z  ERROR   cmd/keeper.go:719       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2024-01-21T02:16:59Z app[287444da0d7738] ewr [info]checking stolon status
2024-01-21T02:17:00Z proxy[287444da0d7738] ewr [error]could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Transport endpoint is not connected (os error 107))
2024-01-21T02:17:00Z proxy[287444da0d7738] ewr [error]could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Transport endpoint is not connected (os error 107))
2024-01-21T02:17:00Z app[287444da0d7738] ewr [info]checking stolon status

This is a common source of sudden Postgres failures: you’ve exhausted its 1GB volume.

Has this only been running for 4 days?


Sometimes the WAL files can fill up the device, due to configuration glitches.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.