PostgreSQL down, not starting

This morning, after upgrading PostgreSQL image in one of my apps (tapiceriasmarcos-db) it didn’t start again.

It shows this error:

 2022-11-13T09:47:44.304 runner[02b7b4e8] mad [info] Starting virtual machine

2022-11-13T09:47:44.493 app[02b7b4e8] mad [info] Starting init (commit: 81d5330)...

2022-11-13T09:47:44.512 app[02b7b4e8] mad [info] Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755

2022-11-13T09:47:44.515 app[02b7b4e8] mad [info] Preparing to run: `docker-entrypoint.sh start` as root

2022-11-13T09:47:44.531 app[02b7b4e8] mad [info] 2022/11/13 09:47:44 listening on [fdaa:0:903a:a7b:2658:1:af59:2]:22 (DNS: [fdaa::3]:53)

2022-11-13T09:47:44.584 app[02b7b4e8] mad [info] cluster spec filename /fly/cluster-spec.json

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] system memory: 256mb vcpu count: 1

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] {

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "initMode": "existing",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "existingConfig": {

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "keeperUID": "26581af592"

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] },

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "pgParameters": {

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "archive_command": "if [ $ENABLE_WALG ]; then /usr/local/bin/wal-g wal-push \"%p\"; fi",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "archive_mode": "on",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "archive_timeout": "60",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "effective_cache_size": "192MB",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "effective_io_concurrency": "200",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "maintenance_work_mem": "64MB",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "max_connections": "300",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "max_parallel_workers": "8",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "max_parallel_workers_per_gather": "2",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "max_worker_processes": "8",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "random_page_cost": "1.1",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "shared_buffers": "64MB",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "wal_compression": "on",

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "work_mem": "4MB"

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] },

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "maxStandbysPerSender": 50,

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] "deadKeeperRemovalInterval": "1h"

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] }

2022-11-13T09:47:44.585 app[02b7b4e8] mad [info] generated new config

2022-11-13T09:47:44.586 app[02b7b4e8] mad [info] keeper | Running...

2022-11-13T09:47:44.587 app[02b7b4e8] mad [info] sentinel | Running...

2022-11-13T09:47:44.588 app[02b7b4e8] mad [info] proxy | Running...

2022-11-13T09:47:44.590 app[02b7b4e8] mad [info] exporter | Running...

2022-11-13T09:47:44.657 app[02b7b4e8] mad [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"

2022-11-13T09:47:44.671 app[02b7b4e8] mad [info] proxy | [WARNING] 316/094744 (546) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.

2022-11-13T09:47:44.681 app[02b7b4e8] mad [info] proxy | [NOTICE] 316/094744 (546) : New worker #1 (570) forked

2022-11-13T09:47:45.586 app[02b7b4e8] mad [info] checking stolon status

2022-11-13T09:47:46.738 app[02b7b4e8] mad [info] proxy | [WARNING] 316/094746 (570) : bk_db/pg1 changed its IP from (none) to fdaa:0:903a:a7b:2658:1:af59:2 by flydns/dns1.

2022-11-13T09:47:46.738 app[02b7b4e8] mad [info] proxy | [WARNING] 316/094746 (570) : Server bk_db/pg1 ('mad.tapiceriasmarcos-db.internal') is UP/READY (resolves again).

2022-11-13T09:47:46.738 app[02b7b4e8] mad [info] proxy | [WARNING] 316/094746 (570) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.

2022-11-13T09:47:47.071 app[02b7b4e8] mad [info] panic: error checking stolon status: cannot get cluster data: invalid character '\x1d' in string literal

2022-11-13T09:47:47.071 app[02b7b4e8] mad [info] : exit status 1

2022-11-13T09:47:47.071 app[02b7b4e8] mad [info] goroutine 9 [running]:

2022-11-13T09:47:47.071 app[02b7b4e8] mad [info] main.main.func2(0xc0000d0000, 0xc000084a80)

2022-11-13T09:47:47.071 app[02b7b4e8] mad [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c

2022-11-13T09:47:47.071 app[02b7b4e8] mad [info] created by main.main

2022-11-13T09:47:47.071 app[02b7b4e8] mad [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b

2022-11-13T09:47:47.528 app[02b7b4e8] mad [info] Umorting clean up.

2022-11-13T09:47:47.528 app[02b7b4e8] mad [info] Umounting /dev/vdc from /data 

But the error itself looks like this:

2022-11-13T09:47:47.071 app[02b7b4e8] mad [info] panic: error checking stolon status: cannot get cluster data: invalid character '\x1d' in string literal

I tis a completely normal app, with low traffic and a small DB.
Any idea what is happening? The site is down (as expected)

I ahve been able to create a new DB instance from latest snapshot and manually attach it to the running instance.

Some aspects failing when doing it:

  • fly postgres detach did not work because the original isntance was down.
  • fly postgres attach did not work because the DB and user it was creating already existed (as it is a restore).

Finally I had to unset DATABASE_URL and set it back manually, but I’m not sure if the old instance keeps attached internally in some way to the app.

I will leave it online for a few days in case anyone from Fly.io wants to take a look at what happened.