Postgres Failing to start with Stolon error

The Postgres app had been running fine for months, and all of a sudden has stopped and now failing to start… I have tried restarting, upgrading image, scaling, adding regions, all are producing the same result:

2022-10-10T02:34:52.593 app[99f02c4f] sjc [info] panic: error checking stolon status: cannot get cluster data: invalid character ‘\x02’ in string literal

Full log:

2022-10-10T02:34:34.201 runner[99f02c4f] sjc [info] Starting instance

2022-10-10T02:34:35.294 runner[99f02c4f] sjc [info] Configuring virtual machine

2022-10-10T02:34:35.295 runner[99f02c4f] sjc [info] Pulling container image

2022-10-10T02:34:41.466 runner[99f02c4f] sjc [info] Unpacking image

2022-10-10T02:34:49.302 runner[99f02c4f] sjc [info] Preparing kernel init

2022-10-10T02:34:49.475 runner[99f02c4f] sjc [info] Setting up volume 'pg_data'

2022-10-10T02:34:49.614 runner[99f02c4f] sjc [info] Configuring firecracker

2022-10-10T02:34:49.695 runner[99f02c4f] sjc [info] Starting virtual machine

2022-10-10T02:34:49.933 app[99f02c4f] sjc [info] Starting init (commit: bc249e4)...

2022-10-10T02:34:49.953 app[99f02c4f] sjc [info] Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755

2022-10-10T02:34:49.956 app[99f02c4f] sjc [info] Preparing to run: `docker-entrypoint.sh start` as root

2022-10-10T02:34:49.972 app[99f02c4f] sjc [info] 2022/10/10 02:34:49 listening on [fdaa:0:xxx:a7b:x:2:xx:2]:22 (DNS: [fdaa::3]:53)

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] cluster spec filename /fly/cluster-spec.json

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] system memory: 256mb vcpu count: 1

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] {

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "initMode": "new",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "existingConfig": {},

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "pgParameters": {

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "archive_command": "if [ $ENABLE_WALG ]; then /usr/local/bin/wal-g wal-push \"%p\"; fi",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "archive_mode": "on",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "archive_timeout": "60",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "effective_cache_size": "192MB",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "effective_io_concurrency": "200",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "maintenance_work_mem": "64MB",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "max_connections": "300",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "max_parallel_workers": "8",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "max_parallel_workers_per_gather": "2",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "max_wal_size": "1001MB",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "max_worker_processes": "8",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "min_wal_size": "250MB",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "random_page_cost": "1.1",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "shared_buffers": "64MB",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "wal_compression": "on",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "wal_keep_segments": "0",

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "work_mem": "4MB"

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] },

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "maxStandbysPerSender": 50,

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] "deadKeeperRemovalInterval": "1h"

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] }

2022-10-10T02:34:50.019 app[99f02c4f] sjc [info] generated new config

2022-10-10T02:34:50.020 app[99f02c4f] sjc [info] keeper | Running...

2022-10-10T02:34:50.041 app[99f02c4f] sjc [info] sentinel | Running...

2022-10-10T02:34:50.042 app[99f02c4f] sjc [info] proxy | Running...

2022-10-10T02:34:50.045 app[99f02c4f] sjc [info] exporter | Running...

2022-10-10T02:34:50.134 app[99f02c4f] sjc [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"

2022-10-10T02:34:50.151 app[99f02c4f] sjc [info] proxy | [WARNING] 282/023450 (553) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.

2022-10-10T02:34:50.168 app[99f02c4f] sjc [info] proxy | [NOTICE] 282/023450 (553) : New worker #1 (569) forked

2022-10-10T02:34:51.020 app[99f02c4f] sjc [info] checking stolon status

2022-10-10T02:34:52.238 app[99f02c4f] sjc [info] proxy | [WARNING] 282/023452 (569) : bk_db/pg1 changed its IP from (none) to fdaa:0:xx:a7b:xx:0:b268:2 by flydns/dns1.

2022-10-10T02:34:52.238 app[99f02c4f] sjc [info] proxy | [WARNING] 282/023452 (569) : Server bk_db/pg1 ('lax.xxxxx.internal') is UP/READY (resolves again).

2022-10-10T02:34:52.238 app[99f02c4f] sjc [info] proxy | [WARNING] 282/023452 (569) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.

2022-10-10T02:34:52.593 app[99f02c4f] sjc [info] panic: error checking stolon status: cannot get cluster data: invalid character '\x02' in string literal

2022-10-10T02:34:52.593 app[99f02c4f] sjc [info] : exit status 1

2022-10-10T02:34:52.594 app[99f02c4f] sjc [info] goroutine 9 [running]:

2022-10-10T02:34:52.594 app[99f02c4f] sjc [info] main.main.func2(0xc0000ce000, 0xc000082a80)

2022-10-10T02:34:52.594 app[99f02c4f] sjc [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c

2022-10-10T02:34:52.594 app[99f02c4f] sjc [info] created by main.main

2022-10-10T02:34:52.594 app[99f02c4f] sjc [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b

It looks like this went into an out of memory crash loop, which ended up corrupting the stolon store.

I increased it to 512mb memory and cleared the stolon state and it’s running just fine now. This DB looks like it probably needs 1GB+ of memory. I would also highly recommended running two instances if this is a DB that you need maximum uptime on, it would have behaved better with a second instance.

@kurt Thanks!

Yes a side project that got a bit of traction and now needs some attention. Added the second instance (Fly makes that soooo easy!) and will up the ram shortly.

I think I’ve run into the same issue:

2023-01-11T11:37:53.235Z ERROR cmd/keeper.go:1041 error retrieving cluster data {“error”: “invalid character ‘\u008d’ after object key:value pair”}

I get the same error if I SSH onto the host and run stolonctl status or stolonctl spec – How do I clear the stolon state? Do I need to reinitialize the whole cluster?