postgres db down

One of our DBs in Sydney has started spitting out the following errors

syd [info]keeper   | 2022-03-10T10:24:57.113Z	ERROR	cmd/keeper.go:720	cannot get configured pg parameters
syd [info]exporter | ERRO[0086] Error opening connection to database (postgresql://flypgadmin connection refused

I’ve tried restarting the db, I’ve also suspended fully and resumed and same issue.

Errors above have stopped but still unreachable

New error

sentinel | 2022-03-10T10:30:42.700Z	ERROR	cmd/sentinel.go:1893	failed to get proxies info	{"error": "unexpected end of JSON input"}

fly status showing 2 health checks passing, 1 critical

Seeing a similar issue… cannot connect to the postgres app, the logs from the postgres app:

2022-03-10T15:18:03Z app[12940ef8] lax [info]proxy    | 2022-03-10T15:18:03.765Z        INFO    cmd/proxy.go:124        Starting proxying
2022-03-10T15:18:03Z app[12940ef8] lax [info]proxy    | 2022-03-10T15:18:03.767Z        INFO    cmd/proxy.go:268        master address  {"address": "[fdaa:0:266b:a7b:87:0:1c97:2]:5433"}
2022-03-10T15:18:04Z app[12940ef8] lax [info]keeper   | 2022-03-10T15:18:04.667Z        ERROR   cmd/keeper.go:720       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-03-10T15:18:04Z app[12940ef8] lax [info]keeper is healthy, db is healthy, role: master
2022-03-10T15:18:07Z app[12940ef8] lax [info]keeper   | 2022-03-10T15:18:07.168Z        ERROR   cmd/keeper.go:720       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-03-10T15:18:08Z app[12940ef8] lax [info]exporter | INFO[0056] Established new database connection to "fdaa:0:266b:a7b:87:0:1c97:2:5433".  source="postgres_exporter.go:970"
2022-03-10T15:18:09Z app[12940ef8] lax [info]exporter | ERRO[0057] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:266b:a7b:87:0:1c97:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:266b:a7b:87:0:1c97:2]:5433: connect: connection refused  source="postgres_exporter.go:1658"
2022-03-10T15:18:09Z app[12940ef8] lax [info]keeper   | 2022-03-10T15:18:09.669Z        ERROR   cmd/keeper.go:720       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-03-10T15:18:09Z app[12940ef8] lax [info]error connecting to local postgres context deadline exceeded
2022-03-10T15:18:09Z app[12940ef8] lax [info]checking stolon status
2022-03-10T15:18:10Z app[12940ef8] lax [info]proxy    | 2022-03-10T15:18:10.638Z        INFO    cmd/proxy.go:304        check timeout timer fired
2022-03-10T15:18:10Z app[12940ef8] lax [info]proxy    | 2022-03-10T15:18:10.639Z        INFO    cmd/proxy.go:158        Stopping listening
2022-03-10T15:18:11Z app[12940ef8] lax [info]keeper   | 2022-03-10T15:18:11.375Z        INFO    cmd/keeper.go:1094      our db boot UID is different than the cluster data one, waiting for it to be updated    {"bootUUID": "4ee0eff1-f5bf-474b-a41e-d7ff55460b0b", "clusterBootUUID": "d7d98195-1335-47cd-9106-234cf8b6cc94"}
2022-03-10T15:18:12Z app[12940ef8] lax [info]keeper   | 2022-03-10T15:18:12.171Z        ERROR   cmd/keeper.go:720       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-03-10T15:18:13Z app[12940ef8] lax [info]proxy    | 2022-03-10T15:18:13.061Z        INFO    cmd/proxy.go:286        proxying to master address      {"address": "[fdaa:0:266b:a7b:87:0:1c97:2]:5433"}
2022-03-10T15:18:14Z app[12940ef8] lax [info]keeper   | 2022-03-10T15:18:14.672Z        ERROR   cmd/keeper.go:720       cannot get configured pg parameters     {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2022-03-10T15:18:16Z app[12940ef8] lax [info]keeper is healthy, db is healthy, role: master

Hey there, i’m taking a look at this now.

@mikeb1

Looks like you’re running a very old version of our image.

You can see what version you’re on by running:

fly image show --app <app-name>

I would try updating and see if that addresses your issue.

@shawn Could you let me know which app you’re having problems with?

I’m seeing this:

T2022-03-10T16:45:25Z app[215cf724] lhr [info]sentinel | 2022-03-10T16:45:25.814Z      ERROR   cmd/sentinel.go:102     election loop error     {"error": "Put \"https://consul-fra.fly-shared.net/v1/session/create?wait=5000ms\": EOF"}
2022-03-10T16:45:29Z app[215cf724] lhr [info]sentinel | 2022-03-10T16:45:29.637Z        ERROR   cmd/sentinel.go:1843    error retrieving cluster data   {"error": "Get \"https://consul-fra.fly-shared.net/v1/kv/my-app-ejpon178o369dgr4/my-app/clusterdata?consistent=&wait=5000ms\": EOF"}
2022-03-10T16:45:35Z app[215cf724] lhr [info]sentinel | 2022-03-10T16:45:35.815Z        INFO    cmd/sentinel.go:82      Trying to acquire sentinels leadership
2022-03-10T16:45:39Z app[215cf724] lhr [info]keeper   | 2022-03-10T16:45:39.248Z        ERROR   cmd/keeper.go:1041      error retrieving cluster data   {"error": "Get \"https://consul-fra.fly-shared.net/v1/kv/my-app-ejpon178o369dgr4/my-app/clusterdata?consistent=&wait=5000ms\": EOF"}

@scytale Since that region is lhr, it may be related to this:

I’m getting my requests timing out for an app that’s in three regions, but I’m closest to lhr so I guess my requests are always routed there. And failing.

1 Like

@shaun

It shows:
Deployment Status
Registry = registry.fly.io
Repository =
Tag = deployment-1646088842
Version = N/A
Digest = sha256:3430fd5f7a97313bc242a4ce16e2c77e39bc4b7e63cff3f82927113e82a211f8

The log is different now, and appears to be up, but still cannot connect to it, and the dashboard wont load either

@shaun Looks like the DB is back (upgraded it) and I can get to it externally. Now trying to work around the sea packet loss issue… we moved the app out of sea but that’s having issues, so guessing it will come good once the packet loss resolves. Thanks for the help!

1 Like

Did you find anything? It was back up for a bit but is back down again now. I did an image update same errors as before plus some new ones

syd [info]proxy    | [ALERT] 068/215147 (564) : backend 'bk_db' has no server available!
app[7540697e] syd [info]panic: error checking stolon status: cannot create kv store: Unexpected response code: 500 (No cluster leader)

Just keeps restarting now

It’s back up now.

1 Like

Any idea why image update is failing for me?

$ flyctl image update --app production-db
? Update `production-db` from flyio/postgres:13.5 v0.0.9 to flyio/postgres:13.5 v0.0.16? Yes
Release v14 created

You can detach the terminal anytime without stopping the update
==> Monitoring deployment

 2 desired, 1 placed, 0 healthy, 1 unhealthy [health checks: 3 total]
Failed Instances

Failure #1

Instance
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS	HEALTH CHECKS	RESTARTCREATED 
628d8681	       	14     	iad   	run    	failed	3 total      	0      20s ago	

Recent Events
TIMESTAMP           	TYPE           	MESSAGE                                           
2022-03-10T22:08:50Z	Received       	Task received by client                          	
2022-03-10T22:09:07Z	Task Setup     	Building Task Directory                          	
2022-03-10T22:09:09Z	Started        	Task started by client                           	
2022-03-10T22:09:11Z	Terminated     	Exit Code: 2                                     	
2022-03-10T22:09:11Z	Not Restarting 	Policy allows no restarts                        	
2022-03-10T22:09:11Z	Alloc Unhealthy	Unhealthy because of failed task                 	
2022-03-10T22:09:12Z	Killing        	Sent interrupt. Waiting 5m0s before force killing	

2022-03-10T22:09:07Z   [info]Starting instance
2022-03-10T22:09:07Z   [info]Configuring virtual machine
2022-03-10T22:09:07Z   [info]Unpacking image
2022-03-10T22:09:08Z   [info]Setting up volume 'pg_data'
2022-03-10T22:09:09Z   [info]Configuring firecracker
2022-03-10T22:09:09Z   [info]Starting virtual machine
2022-03-10T22:09:09Z   [info]Starting init (commit: 0c50bff)...
2022-03-10T22:09:09Z   [info]Preparing to run: `docker-entrypoint.sh start` as root
2022-03-10T22:09:09Z   [info]2022/03/10 22:09:09 listening on [fdaa:0:309a:a7b:ab9:0:30e5:2]:22 (DNS: [fdaa::3]:53)
2022-03-10T22:09:10Z   [info]Main child exited normally with code: 2
2022-03-10T22:09:10Z   [info]Starting clean up.
2022-03-10T22:09:10Z   [info]Umounting /dev/vdc from /data
--> v14 failed - Failed due to unhealthy allocations and deploying as v15 

--> Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort

@enaia Looks like it’s good now. Sorry for the trouble.

@shaun I think the latest deployment failed as well.

@shaun @kurt - we only have one running instance. Can you look into this?

Hey everyone we’re totally down. Can someone help me?

I’ve been trying to get some help since yesterday. Can someone look into the issue we’re experiencing? postgres db down - #16 by enaia

This should be good now, we’ve moved you off of an old etcd coordinator and on to a consul coordinator. It seems like the image upgrade made the etcd stuff not work properly.

Thanks @kurt. Any update on email support?