EWR Postgres instances down?

I’ve been seeing issues with our Postgres instances that are apparently down for about 3 hours now. Errors in the logs that I’m seeing are:

2022-11-02T20:12:54Z runner[aeca3e4c] ewr [info]Starting instance
2022-11-02T20:12:55Z runner[aeca3e4c] ewr [info]Configuring virtual machine
2022-11-02T20:12:55Z runner[aeca3e4c] ewr [info]Preparing kernel init
2022-11-02T20:12:56Z runner[aeca3e4c] ewr [info]Configuring firecracker
2022-11-02T20:12:56Z app[aeca3e4c] ewr [info]Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2022-11-02T20:12:56Z app[aeca3e4c] ewr [info]2022/11/02 20:12:56 listening on [fdaa:0:31ff:a7b:ab2:1:e5ec:2]:22 (DNS: [fdaa::3]:53)
2022-11-02T20:12:56Z app[aeca3e4c] ewr [info][    0.181986] EXT4-fs error (device vdc): ext4_lookup:1709: inode #811: comm chown: deleted inode referenced: 914
2022-11-02T20:12:56Z app[aeca3e4c] ewr [info][    0.196010] EXT4-fs error (device vdc): htree_dirblock_to_tree:1003: inode #1477: comm chown: Directory block failed checksum
2022-11-02T20:12:57Z app[aeca3e4c] ewr [info]Starting clean up.
2022-11-02T20:12:57Z app[aeca3e4c] ewr [info]Umounting /dev/vdc from /data
2022-11-02T20:13:03Z runner[aeca3e4c] ewr [info]Starting instance
2022-11-02T20:13:04Z runner[aeca3e4c] ewr [info]Configuring virtual machine
2022-11-02T20:13:04Z runner[aeca3e4c] ewr [info]Unpacking image
2022-11-02T20:13:04Z runner[aeca3e4c] ewr [info]Preparing kernel init
2022-11-02T20:13:05Z runner[aeca3e4c] ewr [info]Setting up volume 'pg_data'
2022-11-02T20:13:05Z runner[aeca3e4c] ewr [info]Configuring firecracker
2022-11-02T20:13:06Z app[aeca3e4c] ewr [info]Starting init (commit: ce4cf1b)...
2022-11-02T20:13:06Z app[aeca3e4c] ewr [info]Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2022-11-02T20:13:06Z app[aeca3e4c] ewr [info]Preparing to run: `docker-entrypoint.sh start` as root
2022-11-02T20:13:06Z app[aeca3e4c] ewr [info]2022/11/02 20:13:06 listening on [fdaa:0:31ff:a7b:ab2:1:e5ec:2]:22 (DNS: [fdaa::3]:53)
2022-11-02T20:13:06Z app[aeca3e4c] ewr [info][    0.241713] EXT4-fs error (device vdc): htree_dirblock_to_tree:1003: inode #1482: comm chown: Directory block failed checksum
2022-11-02T20:13:06Z app[aeca3e4c] ewr [info]panic: exit status 1
2022-11-02T20:13:07Z app[aeca3e4c] ewr [info]Starting clean up.
2022-11-02T20:13:12Z app[c12ee776] ewr [info]sentinel | 2022-11-02T20:13:12.180Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}

I’m not seeing any outage on the status page, but maybe there’s an outage that just hasn’t been reported yet?

Hello! it looks like your postgres volume is reaching full capacity.

use fly volume list and fly volume extend to increase its size.

I doubled the volume size, and it’s still crashing. Current logs:

2022-11-02T21:06:38Z runner[c12ee776] ewr [info]Starting instance
2022-11-02T21:06:40Z runner[c12ee776] ewr [info]Configuring virtual machine
2022-11-02T21:06:40Z runner[c12ee776] ewr [info]Pulling container image
2022-11-02T21:06:40Z runner[c12ee776] ewr [info]Unpacking image
2022-11-02T21:06:40Z runner[c12ee776] ewr [info]Preparing kernel init
2022-11-02T21:06:40Z runner[c12ee776] ewr [info]Setting up volume 'pg_data'
2022-11-02T21:06:40Z runner[c12ee776] ewr [info]Configuring firecracker
2022-11-02T21:06:40Z runner[c12ee776] ewr [info]Starting virtual machine
2022-11-02T21:06:41Z app[c12ee776] ewr [info]Starting init (commit: ce4cf1b)...
2022-11-02T21:06:41Z app[c12ee776] ewr [info]Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2022-11-02T21:06:41Z app[c12ee776] ewr [info]Preparing to run: `docker-entrypoint.sh start` as root
2022-11-02T21:06:41Z app[c12ee776] ewr [info]2022/11/02 21:06:41 listening on [fdaa:0:31ff:a7b:ab3:1:e5ed:2]:22 (DNS: [fdaa::3]:53)
2022-11-02T21:06:41Z app[c12ee776] ewr [info]cluster spec filename /fly/cluster-spec.json
2022-11-02T21:06:41Z app[c12ee776] ewr [info]system memory: 1024mb vcpu count: 1
2022-11-02T21:06:41Z app[c12ee776] ewr [info]{
2022-11-02T21:06:41Z app[c12ee776] ewr [info]    "initMode": "existing",
2022-11-02T21:06:41Z app[c12ee776] ewr [info]    "existingConfig": {
2022-11-02T21:06:41Z app[c12ee776] ewr [info]        "keeperUID": "ab31e5ed2"
2022-11-02T21:06:41Z app[c12ee776] ewr [info]    },
2022-11-02T21:06:41Z app[c12ee776] ewr [info]sentinel | Running...
2022-11-02T21:06:41Z app[c12ee776] ewr [info]exporter | INFO[0000] Starting Server: :9187                        source="postgres_exporter.go:1837"
2022-11-02T21:06:41Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210641 (546) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.
2022-11-02T21:06:41Z app[c12ee776] ewr [info]proxy    | [NOTICE] 305/210641 (546) : New worker #1 (570) forked
2022-11-02T21:06:42Z app[c12ee776] ewr [info]checking stolon status
2022-11-02T21:06:43Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210643 (570) : bk_db/pg1 changed its IP from (none) to fdaa:0:31ff:a7b:ab2:1:e5ec:2 by flydns/dns1.
2022-11-02T21:06:43Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210643 (570) : Server bk_db/pg1 ('ewr.treblehq-db.internal') is UP/READY (resolves again).
2022-11-02T21:06:43Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210643 (570) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2022-11-02T21:06:43Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210643 (570) : bk_db/pg2 changed its IP from (none) to fdaa:0:31ff:a7b:ab3:1:e5ed:2 by DNS cache.
2022-11-02T21:06:43Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210643 (570) : Server bk_db/pg2 ('ewr.treblehq-db.internal') is UP/READY (resolves again).
2022-11-02T21:06:43Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210643 (570) : Server bk_db/pg2 administratively READY thanks to valid DNS answer.
2022-11-02T21:06:43Z app[c12ee776] ewr [info]keeper is healthy, db is healthy, role: master
2022-11-02T21:06:43Z app[c12ee776] ewr [info]keeper   | 2022-11-02T21:06:43.793Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-11-02T21:06:47Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210647 (570) : Server bk_db/pg1 is DOWN, reason: Layer4 timeout, check duration: 2002ms. 1 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2022-11-02T21:06:48Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210648 (570) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5001ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2022-11-02T21:06:48Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210648 (570) : Server bk_db/pg2 is DOWN, reason: Layer7 timeout, check duration: 5000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2022-11-02T21:06:48Z app[c12ee776] ewr [info]proxy    | [NOTICE] 305/210648 (570) : haproxy version is 2.2.9-2+deb11u3
2022-11-02T21:06:48Z app[c12ee776] ewr [info]proxy    | [NOTICE] 305/210648 (570) : path to executable is /usr/sbin/haproxy
2022-11-02T21:06:48Z app[c12ee776] ewr [info]proxy    | [ALERT] 305/210648 (570) : backend 'bk_db' has no server available!
2022-11-02T21:06:48Z app[c12ee776] ewr [info]error connecting to local postgres context deadline exceeded
2022-11-02T21:06:48Z app[c12ee776] ewr [info]checking stolon status
2022-11-02T21:06:48Z app[c12ee776] ewr [info]keeper   | 2022-11-02T21:06:48.794Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-11-02T21:06:49Z app[c12ee776] ewr [info]keeper is healthy, db is healthy, role: master
2022-11-02T21:06:49Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:06:49.559Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:06:51Z app[c12ee776] ewr [info]keeper   | 2022-11-02T21:06:51.295Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-11-02T21:06:53Z app[c12ee776] ewr [info]keeper   | 2022-11-02T21:06:53.795Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-11-02T21:06:54Z app[c12ee776] ewr [info]keeper   | 2022-11-02 21:06:54.200 UTC [599] LOG:  starting PostgreSQL 14.4 (Debian 14.4-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-11-02T21:06:54Z app[c12ee776] ewr [info]keeper   | 2022-11-02 21:06:54.200 UTC [599] LOG:  listening on IPv6 address "fdaa:0:31ff:a7b:ab3:1:e5ed:2", port 5433
2022-11-02T21:06:54Z app[c12ee776] ewr [info]keeper   | 2022-11-02 21:06:54.201 UTC [599] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5433"
2022-11-02T21:06:54Z app[c12ee776] ewr [info]keeper   | 2022-11-02 21:06:54.204 UTC [600] LOG:  database system was shut down at 2022-11-02 21:06:31 UTC
2022-11-02T21:06:54Z app[c12ee776] ewr [info]keeper   | 2022-11-02 21:06:54.211 UTC [599] LOG:  database system is ready to accept connections
2022-11-02T21:06:54Z app[c12ee776] ewr [info]error connecting to local postgres context deadline exceeded
2022-11-02T21:06:54Z app[c12ee776] ewr [info]checking stolon status
2022-11-02T21:06:54Z app[c12ee776] ewr [info]keeper is healthy, db is healthy, role: master
2022-11-02T21:06:54Z app[c12ee776] ewr [info]configuring operator
2022-11-02T21:06:54Z app[c12ee776] ewr [info]operator password does not match config, changing
2022-11-02T21:06:54Z app[c12ee776] ewr [info]operator ready!
2022-11-02T21:06:54Z app[c12ee776] ewr [info]configuring repluser
2022-11-02T21:06:54Z app[c12ee776] ewr [info]repluser password does not match config, changing
2022-11-02T21:06:54Z app[c12ee776] ewr [info]replication ready!
2022-11-02T21:06:55Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:06:55.511Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:07:01Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:07:01.555Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:07:36Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:07:36.228Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:07:38Z app[c12ee776] ewr [info]exporter | INFO[0056] Semantic Version Changed on "fdaa:0:31ff:a7b:ab3:1:e5ed:2:5433": 0.0.0 -> 14.4.0  source="postgres_exporter.go:1539"
2022-11-02T21:07:38Z app[c12ee776] ewr [info]exporter | INFO[0056] Established new database connection to "fdaa:0:31ff:a7b:ab3:1:e5ed:2:5433".  source="postgres_exporter.go:970"
2022-11-02T21:07:59Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:07:59.698Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:08:11Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:08:11.074Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:08:16Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:08:16.525Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:08:28Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:08:28.049Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:08:40Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:08:40.031Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:08:45Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:08:45.534Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:08:51Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:08:51.002Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:08:56Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:08:56.270Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:09:01Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:09:01.947Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:09:07Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:09:07.491Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:09:19Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:09:19.203Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:09:24Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:09:24.747Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:09:35Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:09:35.759Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:09:40Z app[c12ee776] ewr [info]proxy    | [WARNING] 305/210940 (570) : Server bk_db/pg1 was DOWN and now enters maintenance (No IP for server ).
2022-11-02T21:09:41Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:09:41.246Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:09:46Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:09:46.970Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}
2022-11-02T21:09:53Z app[c12ee776] ewr [info]sentinel | 2022-11-02T21:09:53.019Z	WARN	cmd/sentinel.go:276	no keeper info available{"db": "d4b50630", "keeper": "ab21e5ec2"}

The filesystem was still resizing after the restart. I also recreated the replica and the cluster looks healthy again. cya!