Postgres looping with panic: error checking stolon status

Yesterday my webapp stopped working and it looks like its because the db calls are failing, investigating further I see this

 2023-01-20T19:21:17.493 app[73287355cdd285] yyz [info] sentinel | 2023-01-20T19:21:17.492Z ERROR	cmd/sentinel.go:1852 error retrieving cluster data {"error": "invalid character '\\x1f' in string literal"}

2023-01-20T19:21:17.789 app[73287355cdd285] yyz [info] checking stolon status

2023-01-20T19:21:18.003 app[73287355cdd285] yyz [info] panic: error checking stolon status: cannot get cluster data: invalid character '\x1f' in string literal

2023-01-20T19:21:18.003 app[73287355cdd285] yyz [info] : exit status 1

2023-01-20T19:21:18.003 app[73287355cdd285] yyz [info] goroutine 9 [running]:

2023-01-20T19:21:18.003 app[73287355cdd285] yyz [info] main.main.func2(0xc0000d0000, 0xc000084a00)

2023-01-20T19:21:18.003 app[73287355cdd285] yyz [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c

2023-01-20T19:21:18.003 app[73287355cdd285] yyz [info] created by main.main

2023-01-20T19:21:18.003 app[73287355cdd285] yyz [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b

2023-01-20T19:21:23.203 app[73287355cdd285] yyz [info] cluster spec filename /fly/cluster-spec.json

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] system memory: 256mb vcpu count: 1

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] {

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "initMode": "existing",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "existingConfig": {

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "keeperUID": "8a0f1f0552"

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] },

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "pgParameters": {

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "archive_command": "if [ $ENABLE_WALG ]; then /usr/local/bin/wal-g wal-push \"%p\"; fi",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "archive_mode": "on",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "archive_timeout": "60",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "effective_cache_size": "192MB",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "effective_io_concurrency": "200",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "maintenance_work_mem": "64MB",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "max_connections": "300",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "max_parallel_workers": "8",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "max_parallel_workers_per_gather": "2",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "max_worker_processes": "8",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "random_page_cost": "1.1",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "shared_buffers": "64MB",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "wal_compression": "on",

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "work_mem": "4MB"

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] },

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "maxStandbysPerSender": 50,

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] "deadKeeperRemovalInterval": "1h"

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] }

2023-01-20T19:21:23.204 app[73287355cdd285] yyz [info] generated new config

2023-01-20T19:21:23.205 app[73287355cdd285] yyz [info] keeper | Running...

2023-01-20T19:21:23.208 app[73287355cdd285] yyz [info] sentinel | Running...

2023-01-20T19:21:23.208 app[73287355cdd285] yyz [info] proxy | Running...

2023-01-20T19:21:23.212 app[73287355cdd285] yyz [info] exporter | Running...

2023-01-20T19:21:23.370 app[73287355cdd285] yyz [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"

2023-01-20T19:21:23.393 app[73287355cdd285] yyz [info] proxy | [WARNING] 019/192123 (538) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.

2023-01-20T19:21:23.412 app[73287355cdd285] yyz [info] proxy | [NOTICE] 019/192123 (538) : New worker #1 (562) forked

2023-01-20T19:21:23.419 app[73287355cdd285] yyz [info] proxy | [WARNING] 019/192123 (562) : bk_db/pg1 changed its IP from (none) to fdaa:0:9dba:a7b:8a0f:1839:2b2d:2 by flydns/dns1.

2023-01-20T19:21:23.419 app[73287355cdd285] yyz [info] proxy | [WARNING] 019/192123 (562) : Server bk_db/pg1 ('yyz.stuffy-site-pg.internal') is UP/READY (resolves again).

2023-01-20T19:21:23.419 app[73287355cdd285] yyz [info] proxy | [WARNING] 019/192123 (562) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.

2023-01-20T19:21:23.811 app[73287355cdd285] yyz [info] keeper | 2023-01-20T19:21:23.811Z ERROR	cmd/keeper.go:811 error retrieving cluster data {"error": "invalid character '\\x1f' in string literal"}

2023-01-20T19:21:23.822 app[73287355cdd285] yyz [info] keeper | 2023-01-20T19:21:23.821Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}

2023-01-20T19:21:23.839 app[73287355cdd285] yyz [info] sentinel | 2023-01-20T19:21:23.836Z ERROR	cmd/sentinel.go:1852 error retrieving cluster data {"error": "invalid character '\\x1f' in string literal"}

2023-01-20T19:21:23.839 app[73287355cdd285] yyz [info] keeper | 2023-01-20T19:21:23.839Z ERROR	cmd/keeper.go:1041 error retrieving cluster data {"error": "invalid character '\\x1f' in string literal"}

2023-01-20T19:21:23.992 app[73287355cdd285] yyz [info] exporter | INFO[0000] Established new database connection to "fdaa:0:9dba:a7b:8a0f:1839:2b2d:2:5433". source="postgres_exporter.go:970"

2023-01-20T19:21:24.205 app[73287355cdd285] yyz [info] checking stolon status

2023-01-20T19:21:24.391 app[73287355cdd285] yyz [info] panic: error checking stolon status: cannot get cluster data: invalid character '\x1f' in string literal

2023-01-20T19:21:24.391 app[73287355cdd285] yyz [info] : exit status 1

2023-01-20T19:21:24.391 app[73287355cdd285] yyz [info] goroutine 9 [running]:

2023-01-20T19:21:24.391 app[73287355cdd285] yyz [info] main.main.func2(0xc0000d0000, 0xc000084a00)

2023-01-20T19:21:24.391 app[73287355cdd285] yyz [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c

2023-01-20T19:21:24.391 app[73287355cdd285] yyz [info] created by main.main

2023-01-20T19:21:24.391 app[73287355cdd285] yyz [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b

2023-01-20T19:21:29.578 app[73287355cdd285] yyz [info] cluster spec filename /fly/cluster-spec.json

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] system memory: 256mb vcpu count: 1

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] {

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "initMode": "existing",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "existingConfig": {

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "keeperUID": "8a0f1f0552"

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] },

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "pgParameters": {

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "archive_command": "if [ $ENABLE_WALG ]; then /usr/local/bin/wal-g wal-push \"%p\"; fi",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "archive_mode": "on",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "archive_timeout": "60",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "effective_cache_size": "192MB",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "effective_io_concurrency": "200",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "maintenance_work_mem": "64MB",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "max_connections": "300",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "max_parallel_workers": "8",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "max_parallel_workers_per_gather": "2",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "max_worker_processes": "8",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "random_page_cost": "1.1",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "shared_buffers": "64MB",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "wal_compression": "on",

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "work_mem": "4MB"

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] },

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "maxStandbysPerSender": 50,

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] "deadKeeperRemovalInterval": "1h"

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] }

2023-01-20T19:21:29.579 app[73287355cdd285] yyz [info] generated new config

2023-01-20T19:21:29.580 app[73287355cdd285] yyz [info] keeper | Running...

2023-01-20T19:21:29.583 app[73287355cdd285] yyz [info] sentinel | Running...

2023-01-20T19:21:29.583 app[73287355cdd285] yyz [info] proxy | Running...

2023-01-20T19:21:29.586 app[73287355cdd285] yyz [info] exporter | Running...

2023-01-20T19:21:29.737 app[73287355cdd285] yyz [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"

2023-01-20T19:21:29.769 app[73287355cdd285] yyz [info] proxy | [WARNING] 019/192129 (539) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.

2023-01-20T19:21:29.788 app[73287355cdd285] yyz [info] proxy | [NOTICE] 019/192129 (539) : New worker #1 (563) forked

2023-01-20T19:21:29.791 app[73287355cdd285] yyz [info] proxy | [WARNING] 019/192129 (563) : bk_db/pg1 changed its IP from (none) to fdaa:0:9dba:a7b:8a0f:1839:2b2d:2 by flydns/dns1.

2023-01-20T19:21:29.791 app[73287355cdd285] yyz [info] proxy | [WARNING] 019/192129 (563) : Server bk_db/pg1 ('yyz.stuffy-site-pg.internal') is UP/READY (resolves again).

2023-01-20T19:21:29.791 app[73287355cdd285] yyz [info] proxy | [WARNING] 019/192129 (563) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.

2023-01-20T19:21:30.012 app[73287355cdd285] yyz [info] keeper | 2023-01-20T19:21:30.010Z ERROR	cmd/keeper.go:811 error retrieving cluster data {"error": "invalid character '\\x1f' in string literal"}

2023-01-20T19:21:30.024 app[73287355cdd285] yyz [info] keeper | 2023-01-20T19:21:30.022Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}

2023-01-20T19:21:30.049 app[73287355cdd285] yyz [info] sentinel | 2023-01-20T19:21:30.048Z ERROR	cmd/sentinel.go:1852 error retrieving cluster data {"error": "invalid character '\\x1f' in string literal"}

2023-01-20T19:21:30.059 app[73287355cdd285] yyz [info] keeper | 2023-01-20T19:21:30.059Z ERROR	cmd/keeper.go:1041 error retrieving cluster data {"error": "invalid character '\\x1f' in string literal"}

2023-01-20T19:21:30.580 app[73287355cdd285] yyz [info] checking stolon status

2023-01-20T19:21:30.780 app[73287355cdd285] yyz [info] panic: error checking stolon status: cannot get cluster data: invalid character '\x1f' in string literal

2023-01-20T19:21:30.780 app[73287355cdd285] yyz [info] : exit status 1

2023-01-20T19:21:30.780 app[73287355cdd285] yyz [info] goroutine 9 [running]:

2023-01-20T19:21:30.780 app[73287355cdd285] yyz [info] main.main.func2(0xc0000d0000, 0xc000084a00)

2023-01-20T19:21:30.780 app[73287355cdd285] yyz [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c

2023-01-20T19:21:30.780 app[73287355cdd285] yyz [info] created by main.main

2023-01-20T19:21:30.780 app[73287355cdd285] yyz [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b 

I also see that the admin bot deployed it yesterday?
image

Hello Steven!

The Fly Admin Bot events are from our migration scripts that moved your PG app to Fly Apps V2. We have started with the free postgres apps for now and yours was one of them.

That said, the error started before the migration and was caused by corrupted Stolon ClusterData stored in the Consul. I managed to restore your postgres app by reconstructing the clusterdata and updating its value in Consul.

There was no data loss in the process.
Cheers!

1 Like

@dangra, I’m affected by this. I have a database that has been running for six months with no issues, and I’m not a free user, I pay more than $16 per month on average. My database was updated to v2, and after running every command I’ve found in the forums, I haven’t been able to get it to come up again. It is stuck in stolon status. Also fly pg restart shows “Error no active leader found”. I can’t access my application at all.

In the future will we receive notice before the upgrades? Or did we receive one that I have missed?

Hey @khalid,

Can you let us know what database was impacted?

Thank you!