Hi. Last night my liveview app lost connection to the postgres db, and is in a perpetual “pending” state, despite restarting the postgres instance. How can I fix this?
Can you take a look at your app logs to identify any specific issues?
fly logs -a your-app-name
Phoenix is very helpful letting users know what’s happening
Feel free to paste it here if there’s no sensitive data
Thanks for the quick response
This a relevant excerpt from the elixir app log:
2022-04-09T20:29:57Z app[ed47e1cc] ams [info]20:29:57.665 [error] Postgrex.Protocol (#PID<0.1910.2>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (intersplat-db.internal:5432): host is unreachable - :ehostunreach
2022-04-09T20:35:20Z app[ed47e1cc] ams [info]20:35:20.043 [error] Postgrex.Protocol (#PID<0.1907.2>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (intersplat-db.internal:5432): timeout
My database (intersplat-db) does not return anything with the log command.
Hello! It looks like your DB crashed. It’s running an older version of our Postgres setup that’s using etcd for a backing store instead of consul. I just fixed that. Can you see if it’s working now?
When you get a chance it’s worth running fly image show -a intersplat-db
and then fly image update -a intersplat-db
.
I ran the commands, and it deployed successfully. However, my elixir instance is having trouble connecting to the db FATAL: password authentication failed for user "dark_bush_7960_g4k3rxg1e0x0qznj"
. Did something change configuration wise for elixir apps?
Looking! That’s definitely a problem.
Sorry to jump on norseboats post but having exactly the same issue here. DB crashed at some point today. Just tried to update the image as per this suggestion and it didn’t go well:
2022-04-09T21:36:30Z [info]proxy | [WARNING] 098/213630 (565) : bk_db/pg1 changed its IP from (none) to fdaa:0:3118:a7b:a9a:0:386c:2 by flydns/dns1.
2022-04-09T21:36:30Z [info]proxy | [WARNING] 098/213630 (565) : Server bk_db/pg1 ('lhr.chrx-db1.internal') is UP/READY (resolves again).
2022-04-09T21:36:30Z [info]proxy | [WARNING] 098/213630 (565) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2022-04-09T21:36:31Z [info]exporter | ERRO[0002] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:3118:a7b:a9a:0:386c:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:3118:a7b:a9a:0:386c:2]:5433: connect: connection refused source="postgres_exporter.go:1658"
2022-04-09T21:36:34Z [info]exporter | INFO[0006] Established new database connection to "fdaa:0:3118:a7b:a9a:0:386c:2:5433". source="postgres_exporter.go:970"
2022-04-09T21:36:35Z [info]exporter | ERRO[0007] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:3118:a7b:a9a:0:386c:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:3118:a7b:a9a:0:386c:2]:5433: connect: connection refused source="postgres_exporter.go:1658"
2022-04-09T21:36:37Z [info]proxy | [WARNING] 098/213637 (565) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2022-04-09T21:36:37Z [info]proxy | [NOTICE] 098/213637 (565) : haproxy version is 2.2.9-2+deb11u3
2022-04-09T21:36:37Z [info]proxy | [NOTICE] 098/213637 (565) : path to executable is /usr/sbin/haproxy
2022-04-09T21:36:37Z [info]proxy | [ALERT] 098/213637 (565) : backend 'bk_db' has no server available!
2022-04-09T21:36:42Z [info]keeper | {"level":"warn","ts":"2022-04-09T21:36:42.347Z","logger":"etcd-client","caller":"v3@v3.5.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002bddc0/#initially=[etcd-na.fly-shared.net:443]","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
2022-04-09T21:36:42Z [info]sentinel | {"level":"warn","ts":"2022-04-09T21:36:42.349Z","logger":"etcd-client","caller":"v3@v3.5.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002bfdc0/#initially=[etcd-na.fly-shared.net:443]","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
2022-04-09T21:36:42Z [info]sentinel | 2022-04-09T21:36:42.349Z FATAL cmd/sentinel.go:2021 cannot create sentinel: cannot create store: cannot create kv store: etcdserver: request timed out
2022-04-09T21:36:42Z [info]keeper | 2022-04-09T21:36:42.351Z FATAL cmd/keeper.go:2118 cannot create keeper: cannot create store: cannot create kv store: etcdserver: request timed out
2022-04-09T21:36:42Z [info]panic: error checking stolon status: {"level":"warn","ts":"2022-04-09T21:36:42.348Z","logger":"etcd-client","caller":"v3@v3.5.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004281c0/#initially=[etcd-na.fly-shared.net:443]","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
2022-04-09T21:36:42Z [info]cannot create kv store: etcdserver: request timed out
2022-04-09T21:36:42Z [info]: exit status 1
2022-04-09T21:36:42Z [info]goroutine 9 [running]:
2022-04-09T21:36:42Z [info]main.main.func2(0xc0000d0000, 0xc000075710)
2022-04-09T21:36:42Z [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c
2022-04-09T21:36:42Z [info]created by main.main
2022-04-09T21:36:42Z [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b
2022-04-09T21:36:42Z [info]Main child exited normally with code: 2
2022-04-09T21:36:42Z [info]Reaped child process with pid: 541 and signal: SIGKILL, core dumped? false
2022-04-09T21:36:42Z [info]Reaped child process with pid: 544 and signal: SIGKILL, core dumped? false
2022-04-09T21:36:42Z [info]Reaped child process with pid: 538, exit code: 1
2022-04-09T21:36:42Z [info]Reaped child process with pid: 536, exit code: 1
2022-04-09T21:36:42Z [info]Reaped child process with pid: 565, exit code: 1
2022-04-09T21:36:42Z [info]Starting clean up.
2022-04-09T21:36:42Z [info]Umounting /dev/vdc from /data
Database is chrx-db1
@norseboat there is something up with replication on this DB. We have it running properly with a single node, but adding replicas is causing problems. We’re going to keep working on it, you should be good to go right now though.
@kurt Thank you so much! It seems to be working as intended now Excellent service as always
@aaronrussell I recovered your DB and moved it off the etcd coordinator we were using. It had the same issue and should be much more reliable now.
thank you