Database is down (and our APM don't like that)

Hi,

today all of a sudden our staging db went down:



022-10-25T16:20:40.690 app[7067fe0c] fra [info] exporter | Stopping interrupt...
2022-10-25T16:20:40.690 app[7067fe0c] fra [info] keeper | Stopping interrupt...
2022-10-25T16:20:40.690 app[7067fe0c] fra [info] sentinel | Stopping interrupt...
2022-10-25T16:20:40.690 app[7067fe0c] fra [info] proxy | Stopping interrupt...
2022-10-25T16:20:40.690 app[7067fe0c] fra [info] exporter | signal: interrupt
2022-10-25T16:20:40.690 app[7067fe0c] fra [info] keeper | 2022-10-25 16:20:40.688 UTC [602] LOG: received fast shutdown request
2022-10-25T16:20:40.691 app[7067fe0c] fra [info] proxy | [NOTICE] 297/162040 (545) : haproxy version is 2.2.9-2+deb11u3
2022-10-25T16:20:40.691 app[7067fe0c] fra [info] proxy | [NOTICE] 297/162040 (545) : path to executable is /usr/sbin/haproxy
2022-10-25T16:20:40.691 app[7067fe0c] fra [info] proxy | [ALERT] 297/162040 (545) : Current worker #1 (569) exited with code 130 (Interrupt)
2022-10-25T16:20:40.691 app[7067fe0c] fra [info] proxy | [WARNING] 297/162040 (545) : All workers exited. Exiting... (130)
2022-10-25T16:20:40.693 app[7067fe0c] fra [info] sentinel | 2022-10-25T16:20:40.691Z INFO cmd/sentinel.go:1816 stopping stolon sentinel
2022-10-25T16:20:40.699 app[7067fe0c] fra [info] sentinel | Process exited 0
2022-10-25T16:20:40.700 app[7067fe0c] fra [info] keeper | 2022-10-25T16:20:40.694Z INFO postgresql/postgresql.go:384 stopping database
2022-10-25T16:20:40.701 app[7067fe0c] fra [info] keeper | waiting for server to shut down....2022-10-25 16:20:40.701 UTC [602] LOG: aborting any active transactions
2022-10-25T16:20:40.705 app[7067fe0c] fra [info] keeper | 2022-10-25 16:20:40.705 UTC [602] LOG: background worker "logical replication launcher" (PID 609) exited with exit code 1
2022-10-25T16:20:40.706 app[7067fe0c] fra [info] keeper | 2022-10-25 16:20:40.705 UTC [604] LOG: shutting down
2022-10-25T16:20:40.706 app[7067fe0c] fra [info] proxy | exit status 130
2022-10-25T16:20:40.717 app[7067fe0c] fra [info] keeper | 2022-10-25 16:20:40.717 UTC [602] LOG: database system is shut down
2022-10-25T16:20:40.801 app[7067fe0c] fra [info] keeper | done
2022-10-25T16:20:40.801 app[7067fe0c] fra [info] keeper | server stopped
2022-10-25T16:20:40.829 app[7067fe0c] fra [info] keeper | Process exited 0
2022-10-25T16:20:41.684 app[7067fe0c] fra [info] Starting clean up.
2022-10-25T16:20:41.699 app[7067fe0c] fra [info] Umounting /dev/vdc from /data
2022-10-25T16:20:47.716 runner[7067fe0c] fra [info] Starting instance
2022-10-25T16:20:50.640 runner[7067fe0c] fra [info] Configuring virtual machine
2022-10-25T16:20:50.642 runner[7067fe0c] fra [info] Pulling container image
2022-10-25T16:20:57.553 runner[7067fe0c] fra [info] Unpacking image
2022-10-25T16:20:57.567 runner[7067fe0c] fra [info] Preparing kernel init
2022-10-25T16:20:57.734 runner[7067fe0c] fra [info] Setting up volume 'pg_data'
2022-10-25T16:20:57.872 runner[7067fe0c] fra [info] Configuring firecracker
2022-10-25T16:20:57.993 runner[7067fe0c] fra [info] Starting virtual machine
2022-10-25T16:20:58.214 app[7067fe0c] fra [info] Starting init (commit: 249766e)...
2022-10-25T16:20:58.235 app[7067fe0c] fra [info] Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2022-10-25T16:20:58.250 app[7067fe0c] fra [info] Preparing to run: `docker-entrypoint.sh start` as root
2022-10-25T16:20:58.277 app[7067fe0c] fra [info] 2022/10/25 16:20:58 listening on [fdaa:0:6924:a7b:23c4:1:855:2]:22 (DNS: [fdaa::3]:53)
2022-10-25T16:20:58.386 app[7067fe0c] fra [info] cluster spec filename /fly/cluster-spec.json
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] system memory: 256mb vcpu count: 1
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] {
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "initMode": "existing",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "existingConfig": {
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "keeperUID": "23c418552"
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] },
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "pgParameters": {
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "effective_cache_size": "192MB",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "effective_io_concurrency": "200",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "maintenance_work_mem": "64MB",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "max_connections": "300",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "max_parallel_workers": "8",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "max_parallel_workers_per_gather": "2",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "max_worker_processes": "8",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "random_page_cost": "1.1",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "shared_buffers": "64MB",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "wal_compression": "on",
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "work_mem": "4MB"
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] },
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] "maxStandbysPerSender": 50
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] }
2022-10-25T16:20:58.387 app[7067fe0c] fra [info] generated new config
2022-10-25T16:20:58.389 app[7067fe0c] fra [info] keeper | Running...
2022-10-25T16:20:58.392 app[7067fe0c] fra [info] sentinel | Running...
2022-10-25T16:20:58.394 app[7067fe0c] fra [info] proxy | Running...
2022-10-25T16:20:58.397 app[7067fe0c] fra [info] exporter | Running...
2022-10-25T16:20:58.488 app[7067fe0c] fra [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"
2022-10-25T16:20:58.507 app[7067fe0c] fra [info] proxy | [WARNING] 297/162058 (545) : parsing [/fly/haproxy.cfg:37]: Missing LF on last line, file might have been truncated at position 171. This will become a hard error in HAProxy 2.3.
2022-10-25T16:20:58.520 app[7067fe0c] fra [info] proxy | [NOTICE] 297/162058 (545) : New worker #1 (569) forked
2022-10-25T16:20:58.593 app[7067fe0c] fra [info] keeper | 2022-10-25T16:20:58.592Z INFO cmd/keeper.go:2090 exclusive lock on data dir taken
2022-10-25T16:20:58.593 app[7067fe0c] fra [info] sentinel | 2022-10-25T16:20:58.592Z INFO cmd/sentinel.go:2000 sentinel uid {"uid": "3e646279"}
2022-10-25T16:20:59.388 app[7067fe0c] fra [info] checking stolon status
2022-10-25T16:20:59.945 app[7067fe0c] fra [info] exporter | INFO[0001] Established new database connection to "fdaa:0:6924:a7b:23c4:1:855:2:5433". source="postgres_exporter.go:970"
2022-10-25T16:21:00.698 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:00.695Z INFO cmd/keeper.go:568 keeper uid {"uid": "23c418552"}
2022-10-25T16:21:00.707 app[7067fe0c] fra [info] keeper is healthy, db is healthy, role: master
2022-10-25T16:21:00.716 app[7067fe0c] fra [info] sentinel | 2022-10-25T16:21:00.714Z INFO cmd/sentinel.go:82 Trying to acquire sentinels leadership
2022-10-25T16:21:00.721 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:00.719Z INFO cmd/keeper.go:1093 our db boot UID is different than the cluster data one, waiting for it to be updated {"bootUUID": "c31e5e00-fe1b-4c76-ac9a-b09ae2bc71c9", "clusterBootUUID": "6d160c46-0d9b-4302-99ec-b8dc1493ceff"}
2022-10-25T16:21:00.730 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:00.719Z ERROR cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-10-25T16:21:00.761 app[7067fe0c] fra [info] sentinel | 2022-10-25T16:21:00.761Z INFO cmd/sentinel.go:89 sentinel leadership acquired
2022-10-25T16:21:00.950 app[7067fe0c] fra [info] exporter | ERRO[0002] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:6924:a7b:23c4:1:855:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:6924:a7b:23c4:1:855:2]:5433: connect: connection refused source="postgres_exporter.go:1658"
2022-10-25T16:21:03.224 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:03.223Z ERROR cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-10-25T16:21:05.709 app[7067fe0c] fra [info] error connecting to local postgres context deadline exceeded
2022-10-25T16:21:05.709 app[7067fe0c] fra [info] checking stolon status
2022-10-25T16:21:05.732 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:05.726Z ERROR cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-10-25T16:21:05.758 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:05.757Z INFO cmd/keeper.go:1093 our db boot UID is different than the cluster data one, waiting for it to be updated {"bootUUID": "c31e5e00-fe1b-4c76-ac9a-b09ae2bc71c9", "clusterBootUUID": "6d160c46-0d9b-4302-99ec-b8dc1493ceff"}
2022-10-25T16:21:05.866 app[7067fe0c] fra [info] keeper is healthy, db is healthy, role: master
2022-10-25T16:21:08.228 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:08.228Z ERROR cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-10-25T16:21:10.731 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:10.729Z ERROR cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-10-25T16:21:10.866 app[7067fe0c] fra [info] error connecting to local postgres context deadline exceeded
2022-10-25T16:21:10.866 app[7067fe0c] fra [info] checking stolon status
2022-10-25T16:21:10.916 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:10.916Z INFO cmd/keeper.go:1504 our db requested role is master
2022-10-25T16:21:10.948 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:10.945Z INFO postgresql/postgresql.go:319 starting database
2022-10-25T16:21:10.991 app[7067fe0c] fra [info] keeper | 2022-10-25 16:21:10.990 UTC [603] LOG: starting PostgreSQL 14.2 (Debian 14.2-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-10-25T16:21:10.991 app[7067fe0c] fra [info] keeper | 2022-10-25 16:21:10.990 UTC [603] LOG: listening on IPv6 address "fdaa:0:6924:a7b:23c4:1:855:2", port 5433
2022-10-25T16:21:10.992 app[7067fe0c] fra [info] keeper | 2022-10-25 16:21:10.991 UTC [603] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2022-10-25T16:21:10.996 app[7067fe0c] fra [info] keeper | 2022-10-25 16:21:10.995 UTC [604] LOG: database system was shut down at 2022-10-25 16:20:40 UTC
2022-10-25T16:21:11.003 app[7067fe0c] fra [info] keeper | 2022-10-25 16:21:11.002 UTC [603] LOG: database system is ready to accept connections
2022-10-25T16:21:11.098 app[7067fe0c] fra [info] keeper is healthy, db is healthy, role: master
2022-10-25T16:21:11.128 app[7067fe0c] fra [info] configuring operator
2022-10-25T16:21:11.149 app[7067fe0c] fra [info] operator password does not match config, changing
2022-10-25T16:21:11.157 app[7067fe0c] fra [info] operator ready!
2022-10-25T16:21:11.157 app[7067fe0c] fra [info] configuring repluser
2022-10-25T16:21:11.157 app[7067fe0c] fra [info] repluser password does not match config, changing
2022-10-25T16:21:11.167 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:11.166Z INFO cmd/keeper.go:1542 already master
2022-10-25T16:21:11.168 app[7067fe0c] fra [info] replication ready!
2022-10-25T16:21:11.197 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:11.197Z INFO cmd/keeper.go:1675 postgres parameters not changed
2022-10-25T16:21:11.198 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:11.197Z INFO cmd/keeper.go:1702 postgres hba entries not changed
2022-10-25T16:21:14.944 app[7067fe0c] fra [info] exporter | INFO[0016] Established new database connection to "fdaa:0:6924:a7b:23c4:1:855:2:5433". source="postgres_exporter.go:970"
2022-10-25T16:21:14.956 app[7067fe0c] fra [info] exporter | INFO[0016] Established new database connection to "fdaa:0:6924:a7b:23c4:1:855:2:5433". source="postgres_exporter.go:970"
2022-10-25T16:21:14.982 app[7067fe0c] fra [info] exporter | INFO[0016] Semantic Version Changed on "fdaa:0:6924:a7b:23c4:1:855:2:5433": 0.0.0 -> 14.2.0 source="postgres_exporter.go:1539"
2022-10-25T16:21:14.983 app[7067fe0c] fra [info] exporter | INFO[0016] Semantic Version Changed on "fdaa:0:6924:a7b:23c4:1:855:2:5433": 0.0.0 -> 14.2.0 source="postgres_exporter.go:1539"
2022-10-25T16:21:16.301 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:16.301Z INFO cmd/keeper.go:1504 our db requested role is master
2022-10-25T16:21:16.302 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:16.302Z INFO cmd/keeper.go:1542 already master
2022-10-25T16:21:16.333 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:16.333Z INFO cmd/keeper.go:1675 postgres parameters not changed
2022-10-25T16:21:16.333 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:16.333Z INFO cmd/keeper.go:1702 postgres hba entries not changed
2022-10-25T16:21:21.449 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:21.449Z INFO cmd/keeper.go:1504 our db requested role is master
2022-10-25T16:21:21.450 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:21.450Z INFO cmd/keeper.go:1542 already master
2022-10-25T16:21:21.483 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:21.482Z INFO cmd/keeper.go:1675 postgres parameters not changed
2022-10-25T16:21:21.483 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:21.483Z INFO cmd/keeper.go:1702 postgres hba entries not changed
2022-10-25T16:21:26.577 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:26.576Z INFO cmd/keeper.go:1504 our db requested role is master
2022-10-25T16:21:26.577 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:26.577Z INFO cmd/keeper.go:1542 already master
2022-10-25T16:21:26.602 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:26.601Z INFO cmd/keeper.go:1675 postgres parameters not changed
2022-10-25T16:21:26.602 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:26.601Z INFO cmd/keeper.go:1702 postgres hba entries not changed
2022-10-25T16:21:31.658 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:31.658Z INFO cmd/keeper.go:1504 our db requested role is master
2022-10-25T16:21:31.659 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:31.659Z INFO cmd/keeper.go:1542 already master
2022-10-25T16:21:31.693 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:31.692Z INFO cmd/keeper.go:1675 postgres parameters not changed
2022-10-25T16:21:31.693 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:31.693Z INFO cmd/keeper.go:1702 postgres hba entries not changed
2022-10-25T16:21:36.778 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:36.778Z INFO cmd/keeper.go:1504 our db requested role is master
2022-10-25T16:21:36.779 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:36.779Z INFO cmd/keeper.go:1542 already master
2022-10-25T16:21:36.809 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:36.809Z INFO cmd/keeper.go:1675 postgres parameters not changed
2022-10-25T16:21:36.810 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:36.809Z INFO cmd/keeper.go:1702 postgres hba entries not changed
2022-10-25T16:21:41.875 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:41.875Z INFO cmd/keeper.go:1504 our db requested role is master
2022-10-25T16:21:41.876 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:41.876Z INFO cmd/keeper.go:1542 already master
2022-10-25T16:21:41.923 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:41.922Z INFO cmd/keeper.go:1675 postgres parameters not changed
2022-10-25T16:21:41.923 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:41.923Z INFO cmd/keeper.go:1702 postgres hba entries not changed
2022-10-25T16:21:46.998 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:46.997Z INFO cmd/keeper.go:1504 our db requested role is master
2022-10-25T16:21:46.999 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:46.998Z INFO cmd/keeper.go:1542 already master
2022-10-25T16:21:47.038 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:47.037Z INFO cmd/keeper.go:1675 postgres parameters not changed
2022-10-25T16:21:47.038 app[7067fe0c] fra [info] keeper | 2022-10-25T16:21:47.038Z INFO cmd/keeper.go:1702 postgres hba entries not changed

I didn’t see a specific error in the logs and lurked a bit around the forum and found other people reporting similar (kind of) errors and one of the reasons was running out of disk space.

I doubled the disk size from 1G to 2G, but that didn’t help. Then I doubled memory from 256mb to 512mb, but the change wasn’t picked up, so I redeployed it.

fly deploy . --app app-name-db --image flyio/postgres:14.2

and now it just can’t start:

022-10-25T16:37:26.933 app[00dfab56] ord [info] proxy | exit status 1
2022-10-25T16:37:26.933 app[00dfab56] ord [info] proxy | restarting in 1s [attempt 169]
2022-10-25T16:37:27.933 app[00dfab56] ord [info] proxy | Running...
2022-10-25T16:37:27.937 app[00dfab56] ord [info] proxy | [NOTICE] 297/163727 (1478) : haproxy version is 2.2.9-2+deb11u3
2022-10-25T16:37:27.937 app[00dfab56] ord [info] proxy | [NOTICE] 297/163727 (1478) : path to executable is /usr/sbin/haproxy
2022-10-25T16:37:27.937 app[00dfab56] ord [info] proxy | [ALERT] 297/163727 (1478) : parsing [/fly/haproxy.cfg:37] : Can't create DNS resolution for server '(null)'
2022-10-25T16:37:27.937 app[00dfab56] ord [info] proxy | [WARNING] 297/163727 (1478) : parsing [/fly/haproxy.cfg:37]: Missing LF on last line, file might have been truncated at position 171. This will become a hard error in HAProxy 2.3.
2022-10-25T16:37:27.937 app[00dfab56] ord [info] proxy | [ALERT] 297/163727 (1478) : Error(s) found in configuration file : /fly/haproxy.cfg
2022-10-25T16:37:27.937 app[00dfab56] ord [info] proxy | [ALERT] 297/163727 (1478) : Fatal errors found in configuration.

Now everything is in shambles.

Next I will try to scale it to two instances so hopefully we won’t lose staging, but if it won’t work I’ll just drop it.

Is there anything else that I can do?

UPDT:

I tried to follow the Fly Postgres · Fly Docs to setup a database replica but suddenly realised that there are no machines to clone.

However I can login into one :man_shrugging:

root@353e7f69:/# netstat -nl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:5432            0.0.0.0:*               LISTEN
tcp6       0      0 :::5500                 :::*                    LISTEN
tcp6       0      0 :::9187                 :::*                    LISTEN
tcp6       0      0 :::8404                 :::*                    LISTEN
tcp6       0      0 fdaa:0:6924:a7b:23c4:22 :::*                    LISTEN
tcp6       0      0 :::5432                 :::*                    LISTEN
tcp6       0      0 fdaa:0:6924:a7b:23:5433 :::*                    LISTEN
Active UNIX domain sockets (only servers)
Proto RefCnt Flags       Type       State         I-Node   Path
unix  2      [ ACC ]     STREAM     LISTENING     4507     /run/haproxy/haproxy.sock.545.tmp
unix  2      [ ACC ]     STREAM     LISTENING     4772     /tmp/.s.PGSQL.5433
root@353e7f69:/# psql -h localhost
psql: error: connection to server at "localhost" (::1), port 5432 failed: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
1 Like

I’ve got the same problem…