I am running a simple Elixir app with Oban. So Oban is checking periodically for new jobs. However all 1-2 days Postgres seems to be down for 2-3 min in FRA. Log says
Postgrex.Protocol (#PID<0.2810.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
It’s not critical, but I am wondering what the reason for that is.
Will you run fly status when that happens next and see if instances are passing health checks? That seems like the proxy that runs alongside Postgres either isn’t accepting connections, or can’t connect to the leader.
We just shipped some tweaks to the Consul service it’s (probably) using to detect a leader that should help, so we might’ve accidentally fixed this for you.
One thing we’ve been seeing is fly status showing a VM with a rpc error. These nodes seem to be wedged, and if your app tries to connect to them they’ll get errors like this. The fix, right now, is to run fly vm stop <id> on the offending node.
This is something we can automate, but it’s going to take a bit.
So finally I’ve got this error again when I was at a computer:
Instances
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
9f772526 22 fra run running (leader) 3 total, 2 passing, 1 critical 5 2021-08-12T00:38:17Z
In logs:
2021-08-19T06:21:16.407433730Z app[68efc1bf] fra [info] proxy | 2021-08-19T06:21:16.404Z INFO cmd/proxy.go:304 check timeout timer fired
2021-08-19T06:21:20.460734231Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:20.459Z INFO cmd/keeper.go:1505 our db requested role is master
2021-08-19T06:21:20.462585431Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:20.461Z INFO cmd/keeper.go:1543 already master
2021-08-19T06:21:20.479908062Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:20.478Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-08-19T06:21:20.481352544Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:20.480Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-08-19T06:21:25.061008070Z app[68efc1bf] fra [info] proxy | 2021-08-19T06:21:25.057Z INFO cmd/proxy.go:124 Starting proxying
2021-08-19T06:21:25.066124161Z app[68efc1bf] fra [info] proxy | 2021-08-19T06:21:25.063Z INFO cmd/proxy.go:268 master address {"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-08-19T06:21:25.705055024Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:25.703Z INFO cmd/keeper.go:1505 our db requested role is master
2021-08-19T06:21:25.707542410Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:25.706Z INFO cmd/keeper.go:1543 already master
2021-08-19T06:21:25.726919569Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:25.725Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-08-19T06:21:25.728865648Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:25.727Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-08-19T06:21:31.138967284Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:31.137Z INFO cmd/keeper.go:1505 our db requested role is master
2021-08-19T06:21:31.140955713Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:31.139Z INFO cmd/keeper.go:1543 already master
2021-08-19T06:21:31.161345707Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:31.159Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-08-19T06:21:31.162889438Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:31.161Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-08-19T06:21:34.311606644Z app[68efc1bf] fra [info] sentinel | 2021-08-19T06:21:34.308Z ERROR cmd/sentinel.go:1893 failed to get proxies info {"error": "unexpected end of JSON input"}
2021-08-19T06:21:36.315402335Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:36.313Z INFO cmd/keeper.go:1505 our db requested role is master
2021-08-19T06:21:36.317502583Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:36.316Z INFO cmd/keeper.go:1543 already master
2021-08-19T06:21:36.337821387Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:36.336Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-08-19T06:21:36.339462958Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:36.338Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-08-19T06:21:39.853021691Z app[68efc1bf] fra [info] sentinel | 2021-08-19T06:21:39.849Z ERROR cmd/sentinel.go:1893 failed to get proxies info {"error": "unexpected end of JSON input"}
2021-08-19T06:21:41.470153280Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:41.468Z INFO cmd/keeper.go:1505 our db requested role is master
2021-08-19T06:21:41.472015000Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:41.470Z INFO cmd/keeper.go:1543 already master
2021-08-19T06:21:41.488493235Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:41.487Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-08-19T06:21:41.490187186Z app[68efc1bf] fra [info] keeper | 2021-08-19T06:21:41.488Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-08-19T06:21:45.462944078Z app[68efc1bf] fra [info] sentinel | 2021-08-19T06:21:45.459Z ERROR cmd/sentinel.go:1893 failed to get proxies info {"error": "unexpected end of JSON input"}
Actually this problem still exists since around 5:30 UTC…
Can I bump this. I have an Phoenix application and seeing a similar thing. The Db goes down on a regular basis, (1-2 days) it seems to self heal, but I have been seeking feedback for an idea and it is biting me hard.
I sent to a friend and he reported it down at 20.55 (BST) I went to check this morning and the Db was up, but the Instances were create 12 hours ago. (The app has been deployed far longer. I manually killed the instances the first time, so at least there is progress.)
@tjcberman Due to an Etcd bug pertaining to auth token renewals, some PG clusters were getting thrown into a state where Stolon was no longer able to authenticate. Unfortunately, your PG was one of them and we had to rebuild the VM’s yesterday to get them back in order.
The underlying bug is still being worked out here:
In the meantime, we transitioned our Etcd to leverage stateless auth tokens “JWT”, which allows us to workaround the issue.
It is no problem (yet) as I am in early user testing. I guess the best thing is to keep an eye on the PR, for a full fix?? (I am reluctant to send it out as is given the reliability issues)
Thanks for all the good work. Impressed with Fly and once this is sorted will be even more so.
We’re seeing the same issue. Two different PostgreSQL clusters, running on different instance types, both in LHR. They seem to fail every week or so now, starting about 3 weeks ago.
Our backend is implemented in Haskell, so I doubt this has anything to do with the app platform.
Any update on this issue? Our db has been unreliable for 3 days now with a mix of different errors - I assume it’s all related to the above?
Current error:
Version = 0
Status = running
Hostname = chrx-db1.fly.dev
Instances
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
96f31500 0 lhr run running (context dead) 3 total, 3 critical 4 2021-08-18T22:18:08Z
f45afaef 0 lhr run running (context dead) 3 total, 3 critical 6 2021-08-18T14:13:42Z
VM status:
Instance
ID = f45afaef
Version = 0
Region = lhr
Desired = run
Status = running (rpc error: c)
Health Checks = 3 total, 3 critical
Restarts = 6
Created = 2021-08-18T14:13:42Z
Recent Events
TIMESTAMP TYPE MESSAGE
2021-08-18T14:13:38Z Received Task received by client
2021-08-18T14:13:38Z Task Setup Building Task Directory
2021-08-18T14:13:40Z Started Task started by client
2021-08-18T22:17:41Z Restart Signaled User requested restart
2021-08-18T22:17:44Z Terminated Exit Code: 0
2021-08-18T22:17:44Z Restarting Task restarting in 0s
2021-08-18T22:17:46Z Started Task started by client
2021-08-19T18:38:39Z Restart Signaled User requested restart
2021-08-19T18:38:42Z Terminated Exit Code: 0
2021-08-19T18:38:42Z Restarting Task restarting in 0s
2021-08-19T18:38:44Z Started Task started by client
2021-08-19T19:33:25Z Restart Signaled User requested restart
2021-08-19T19:38:30Z Terminated Exit Code: 0
2021-08-19T19:38:30Z Restarting Task restarting in 0s
2021-08-19T19:38:32Z Started Task started by client
2021-08-19T19:38:41Z Restart Signaled User requested restart
2021-08-19T19:38:44Z Terminated Exit Code: 0
2021-08-19T19:38:44Z Restarting Task restarting in 0s
2021-08-19T19:38:46Z Started Task started by client
2021-08-20T08:33:06Z Restart Signaled User requested restart
2021-08-20T08:33:12Z Terminated Exit Code: 0
2021-08-20T08:33:12Z Restarting Task restarting in 0s
2021-08-20T08:33:13Z Started Task started by client
2021-08-20T12:39:50Z Restart Signaled User requested restart
2021-08-20T12:39:55Z Terminated Exit Code: 0
2021-08-20T12:39:56Z Restarting Task restarting in 0s
2021-08-20T12:39:57Z Started Task started by client
Checks
ID SERVICE STATE OUTPUT
role app critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": read unix @->control.sock: read: connection reset by peer
vm app critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": read unix @->control.sock: read: connection reset by peer
pg app critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": read unix @->control.sock: read: connection reset by peer
Longer update coming, but if you run fly vm stop <id> on each of those instances it should recover and we think we’ve corrected most of the issues that get them in that state.