Postgres fails constantly every 1-2 days for 2-3 min

Hello,

I am running a simple Elixir app with Oban. So Oban is checking periodically for new jobs. However all 1-2 days Postgres seems to be down for 2-3 min in FRA. Log says

Postgrex.Protocol (#PID<0.2810.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed

It’s not critical, but I am wondering what the reason for that is.

Thank you.

Will you run fly status when that happens next and see if instances are passing health checks? That seems like the proxy that runs alongside Postgres either isn’t accepting connections, or can’t connect to the leader.

We just shipped some tweaks to the Consul service it’s (probably) using to detect a leader that should help, so we might’ve accidentally fixed this for you.

I have been seeing similar almost every day (once) and thought it might be some automated app cycling. Definitely cluttering up my error reporting.

I will monitor and report back if I see this continue.

One thing we’ve been seeing is fly status showing a VM with a rpc error. These nodes seem to be wedged, and if your app tries to connect to them they’ll get errors like this. The fix, right now, is to run fly vm stop <id> on the offending node.

This is something we can automate, but it’s going to take a bit.

So finally I’ve got this error again when I was at a computer:

Instances
ID       VERSION REGION DESIRED STATUS           HEALTH CHECKS                  RESTARTS CREATED
9f772526 22      fra    run     running (leader) 3 total, 2 passing, 1 critical 5        2021-08-12T00:38:17Z

In logs:

2021-08-19T06:21:16.407433730Z app[68efc1bf] fra [info] proxy    | 2021-08-19T06:21:16.404Z	INFO	cmd/proxy.go:304	check timeout timer fired
2021-08-19T06:21:20.460734231Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:20.459Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-19T06:21:20.462585431Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:20.461Z	INFO	cmd/keeper.go:1543	already master
2021-08-19T06:21:20.479908062Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:20.478Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-19T06:21:20.481352544Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:20.480Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-19T06:21:25.061008070Z app[68efc1bf] fra [info] proxy    | 2021-08-19T06:21:25.057Z	INFO	cmd/proxy.go:124	Starting proxying
2021-08-19T06:21:25.066124161Z app[68efc1bf] fra [info] proxy    | 2021-08-19T06:21:25.063Z	INFO	cmd/proxy.go:268	master address	{"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-08-19T06:21:25.705055024Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:25.703Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-19T06:21:25.707542410Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:25.706Z	INFO	cmd/keeper.go:1543	already master
2021-08-19T06:21:25.726919569Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:25.725Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-19T06:21:25.728865648Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:25.727Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-19T06:21:31.138967284Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:31.137Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-19T06:21:31.140955713Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:31.139Z	INFO	cmd/keeper.go:1543	already master
2021-08-19T06:21:31.161345707Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:31.159Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-19T06:21:31.162889438Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:31.161Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-19T06:21:34.311606644Z app[68efc1bf] fra [info] sentinel | 2021-08-19T06:21:34.308Z	ERROR	cmd/sentinel.go:1893	failed to get proxies info	{"error": "unexpected end of JSON input"}
2021-08-19T06:21:36.315402335Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:36.313Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-19T06:21:36.317502583Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:36.316Z	INFO	cmd/keeper.go:1543	already master
2021-08-19T06:21:36.337821387Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:36.336Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-19T06:21:36.339462958Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:36.338Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-19T06:21:39.853021691Z app[68efc1bf] fra [info] sentinel | 2021-08-19T06:21:39.849Z	ERROR	cmd/sentinel.go:1893	failed to get proxies info	{"error": "unexpected end of JSON input"}
2021-08-19T06:21:41.470153280Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:41.468Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-19T06:21:41.472015000Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:41.470Z	INFO	cmd/keeper.go:1543	already master
2021-08-19T06:21:41.488493235Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:41.487Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-19T06:21:41.490187186Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:41.488Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-19T06:21:45.462944078Z app[68efc1bf] fra [info] sentinel | 2021-08-19T06:21:45.459Z	ERROR	cmd/sentinel.go:1893	failed to get proxies info	{"error": "unexpected end of JSON input"}

Actually this problem still exists since around 5:30 UTC…

FYI It’s working again since 6:30 or so.

Can I bump this. I have an Phoenix application and seeing a similar thing. The Db goes down on a regular basis, (1-2 days) it seems to self heal, but I have been seeking feedback for an idea and it is biting me hard.

I sent to a friend and he reported it down at 20.55 (BST) I went to check this morning and the Db was up, but the Instances were create 12 hours ago. (The app has been deployed far longer. I manually killed the instances the first time, so at least there is progress.)

@tjcberman Due to an Etcd bug pertaining to auth token renewals, some PG clusters were getting thrown into a state where Stolon was no longer able to authenticate. Unfortunately, your PG was one of them and we had to rebuild the VM’s yesterday to get them back in order.

The underlying bug is still being worked out here:

In the meantime, we transitioned our Etcd to leverage stateless auth tokens “JWT”, which allows us to workaround the issue.

Sorry for the trouble!

I can confirm the error is related to etcd.

etcdserver: user name is empty

2021-08-18T14:00:42.166767938Z app[8960ea46] lax [info] sentinel | {"level":"warn","ts":"2021-08-18T14:00:42.157Z","logger":"etcd-client","caller":"v3@v3.5.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002bddc0/#initially=[etcd-na.fly-shared.net:443]","attempt":0,"error":"rpc error: code = InvalidArgument desc = etcdserver: user name is empty"}
2021-08-18T14:00:42.170906832Z app[8960ea46] lax [info] sentinel | 2021-08-18T14:00:42.166Z	ERROR	cmd/sentinel.go:1843	error retrieving cluster data	{"error": "etcdserver: user name is empty"}
2021-08-18T14:00:42.992525945Z app[8960ea46] lax [info] proxy    | {"level":"warn","ts":"2021-08-18T14:00:42.981Z","logger":"etcd-client","caller":"v3@v3.5.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002bddc0/#initially=[etcd-na.fly-shared.net:443]","attempt":0,"error":"rpc error: code = InvalidArgument desc = etcdserver: user name is empty"}

I’m having the same issue few times a day. :frowning:

This is the etcd bug @shaun mentioned in his post. We’re tracking it closely, it should improve today.

2 Likes

Hi @shaun,

It is no problem (yet) as I am in early user testing. I guess the best thing is to keep an eye on the PR, for a full fix?? (I am reluctant to send it out as is given the reliability issues)

Thanks for all the good work. Impressed with Fly and once this is sorted will be even more so.

Hi @shaun,

I just checked again and we seem to have hit the issue again (19:54 BST)

We’re seeing the same issue. Two different PostgreSQL clusters, running on different instance types, both in LHR. They seem to fail every week or so now, starting about 3 weeks ago.

Our backend is implemented in Haskell, so I doubt this has anything to do with the app platform.

These are all related to this, we’re working on it: Postgres fails constantly every 1-2 days for 2-3 min - #8 by shaun

1 Like

Update

JWT lead to more issues that are being tracked here: Bug: Auth revision w/ JWT auth tokens · Issue #13300 · etcd-io/etcd · GitHub

We are heavily considering reverting back to Consul…

Any update on this issue? Our db has been unreliable for 3 days now with a mix of different errors - I assume it’s all related to the above?

Current error:

  Version  = 0
  Status   = running
  Hostname = chrx-db1.fly.dev

Instances
ID       VERSION REGION DESIRED STATUS                 HEALTH CHECKS       RESTARTS CREATED
96f31500 0       lhr    run     running (context dead) 3 total, 3 critical 4        2021-08-18T22:18:08Z
f45afaef 0       lhr    run     running (context dead) 3 total, 3 critical 6        2021-08-18T14:13:42Z

VM status:

Instance
  ID            = f45afaef
  Version       = 0
  Region        = lhr
  Desired       = run
  Status        = running (rpc error: c)
  Health Checks = 3 total, 3 critical
  Restarts      = 6
  Created       = 2021-08-18T14:13:42Z

Recent Events
TIMESTAMP            TYPE             MESSAGE
2021-08-18T14:13:38Z Received         Task received by client
2021-08-18T14:13:38Z Task Setup       Building Task Directory
2021-08-18T14:13:40Z Started          Task started by client
2021-08-18T22:17:41Z Restart Signaled User requested restart
2021-08-18T22:17:44Z Terminated       Exit Code: 0
2021-08-18T22:17:44Z Restarting       Task restarting in 0s
2021-08-18T22:17:46Z Started          Task started by client
2021-08-19T18:38:39Z Restart Signaled User requested restart
2021-08-19T18:38:42Z Terminated       Exit Code: 0
2021-08-19T18:38:42Z Restarting       Task restarting in 0s
2021-08-19T18:38:44Z Started          Task started by client
2021-08-19T19:33:25Z Restart Signaled User requested restart
2021-08-19T19:38:30Z Terminated       Exit Code: 0
2021-08-19T19:38:30Z Restarting       Task restarting in 0s
2021-08-19T19:38:32Z Started          Task started by client
2021-08-19T19:38:41Z Restart Signaled User requested restart
2021-08-19T19:38:44Z Terminated       Exit Code: 0
2021-08-19T19:38:44Z Restarting       Task restarting in 0s
2021-08-19T19:38:46Z Started          Task started by client
2021-08-20T08:33:06Z Restart Signaled User requested restart
2021-08-20T08:33:12Z Terminated       Exit Code: 0
2021-08-20T08:33:12Z Restarting       Task restarting in 0s
2021-08-20T08:33:13Z Started          Task started by client
2021-08-20T12:39:50Z Restart Signaled User requested restart
2021-08-20T12:39:55Z Terminated       Exit Code: 0
2021-08-20T12:39:56Z Restarting       Task restarting in 0s
2021-08-20T12:39:57Z Started          Task started by client

Checks
ID   SERVICE STATE    OUTPUT
role app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": read unix @->control.sock: read: connection reset by peer
vm   app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": read unix @->control.sock: read: connection reset by peer
pg   app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": read unix @->control.sock: read: connection reset by peer

Longer update coming, but if you run fly vm stop <id> on each of those instances it should recover and we think we’ve corrected most of the issues that get them in that state.

1 Like

It just happened to us again.

What’s your db name? We’re cleaning these up right now (we got some good stability fixes in the builds).