Postgres fails constantly every 1-2 days for 2-3 min

mathiasn · August 12, 2021, 5:22pm

Hello,

I am running a simple Elixir app with Oban. So Oban is checking periodically for new jobs. However all 1-2 days Postgres seems to be down for 2-3 min in FRA. Log says

Postgrex.Protocol (#PID<0.2810.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed

It’s not critical, but I am wondering what the reason for that is.

Thank you.

kurt · August 12, 2021, 5:25pm

Will you run fly status when that happens next and see if instances are passing health checks? That seems like the proxy that runs alongside Postgres either isn’t accepting connections, or can’t connect to the leader.

We just shipped some tweaks to the Consul service it’s (probably) using to detect a leader that should help, so we might’ve accidentally fixed this for you.

forest · August 13, 2021, 12:23am

I have been seeing similar almost every day (once) and thought it might be some automated app cycling. Definitely cluttering up my error reporting.

I will monitor and report back if I see this continue.

kurt · August 13, 2021, 10:56pm

One thing we’ve been seeing is fly status showing a VM with a rpc error. These nodes seem to be wedged, and if your app tries to connect to them they’ll get errors like this. The fix, right now, is to run fly vm stop <id> on the offending node.

This is something we can automate, but it’s going to take a bit.

mathiasn · August 19, 2021, 6:19am

So finally I’ve got this error again when I was at a computer:

Instances
ID       VERSION REGION DESIRED STATUS           HEALTH CHECKS                  RESTARTS CREATED
9f772526 22      fra    run     running (leader) 3 total, 2 passing, 1 critical 5        2021-08-12T00:38:17Z

In logs:

2021-08-19T06:21:16.407433730Z app[68efc1bf] fra [info] proxy    | 2021-08-19T06:21:16.404Z	INFO	cmd/proxy.go:304	check timeout timer fired
2021-08-19T06:21:20.460734231Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:20.459Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-19T06:21:20.462585431Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:20.461Z	INFO	cmd/keeper.go:1543	already master
2021-08-19T06:21:20.479908062Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:20.478Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-19T06:21:20.481352544Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:20.480Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-19T06:21:25.061008070Z app[68efc1bf] fra [info] proxy    | 2021-08-19T06:21:25.057Z	INFO	cmd/proxy.go:124	Starting proxying
2021-08-19T06:21:25.066124161Z app[68efc1bf] fra [info] proxy    | 2021-08-19T06:21:25.063Z	INFO	cmd/proxy.go:268	master address	{"address": "[fdaa:0:2d8e:a7b:67:0:2457:2]:5433"}
2021-08-19T06:21:25.705055024Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:25.703Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-19T06:21:25.707542410Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:25.706Z	INFO	cmd/keeper.go:1543	already master
2021-08-19T06:21:25.726919569Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:25.725Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-19T06:21:25.728865648Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:25.727Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-19T06:21:31.138967284Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:31.137Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-19T06:21:31.140955713Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:31.139Z	INFO	cmd/keeper.go:1543	already master
2021-08-19T06:21:31.161345707Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:31.159Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-19T06:21:31.162889438Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:31.161Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-19T06:21:34.311606644Z app[68efc1bf] fra [info] sentinel | 2021-08-19T06:21:34.308Z	ERROR	cmd/sentinel.go:1893	failed to get proxies info	{"error": "unexpected end of JSON input"}
2021-08-19T06:21:36.315402335Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:36.313Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-19T06:21:36.317502583Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:36.316Z	INFO	cmd/keeper.go:1543	already master
2021-08-19T06:21:36.337821387Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:36.336Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-19T06:21:36.339462958Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:36.338Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-19T06:21:39.853021691Z app[68efc1bf] fra [info] sentinel | 2021-08-19T06:21:39.849Z	ERROR	cmd/sentinel.go:1893	failed to get proxies info	{"error": "unexpected end of JSON input"}
2021-08-19T06:21:41.470153280Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:41.468Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-19T06:21:41.472015000Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:41.470Z	INFO	cmd/keeper.go:1543	already master
2021-08-19T06:21:41.488493235Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:41.487Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-19T06:21:41.490187186Z app[68efc1bf] fra [info] keeper   | 2021-08-19T06:21:41.488Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-19T06:21:45.462944078Z app[68efc1bf] fra [info] sentinel | 2021-08-19T06:21:45.459Z	ERROR	cmd/sentinel.go:1893	failed to get proxies info	{"error": "unexpected end of JSON input"}

Actually this problem still exists since around 5:30 UTC…

mathiasn · August 19, 2021, 6:48am

FYI It’s working again since 6:30 or so.

tjcberman · August 19, 2021, 8:22am

Can I bump this. I have an Phoenix application and seeing a similar thing. The Db goes down on a regular basis, (1-2 days) it seems to self heal, but I have been seeking feedback for an idea and it is biting me hard.

I sent to a friend and he reported it down at 20.55 (BST) I went to check this morning and the Db was up, but the Instances were create 12 hours ago. (The app has been deployed far longer. I manually killed the instances the first time, so at least there is progress.)

shaun · August 19, 2021, 3:37pm

@tjcberman Due to an Etcd bug pertaining to auth token renewals, some PG clusters were getting thrown into a state where Stolon was no longer able to authenticate. Unfortunately, your PG was one of them and we had to rebuild the VM’s yesterday to get them back in order.

The underlying bug is still being worked out here:

github.com/etcd-io/etcd

client/v3: refresh the token when ErrUserEmpty is received while retrying

etcd-io:main ← r-ashish:main

opened 12:02AM - 31 Jul 21 UTC

r-ashish

+13 -2

A change was made in the PR [#12549](https://github.com/etcd-io/etcd/pull/12549)… to clear the auth token before attempting to get a new one, when `ErrInvalidAuthToken` is returned by the server. ```golang if callOpts.retryAuth && rpctypes.Error(lastErr) == rpctypes.ErrInvalidAuthToken { // clear auth token before refreshing it. // call c.Auth.Authenticate with an invalid token will always fail the auth check on the server-side, // if the server has not apply the patch of pr #12165 (https://github.com/etcd-io/etcd/pull/12165) // and a rpctypes.ErrInvalidAuthToken will recursively call c.getToken until system run out of resource. c.authTokenBundle.UpdateAuthToken("") gterr := c.getToken(ctx) ... ``` However, if the `getToken` call also fails due to some reason (eg. context deadline exceeded) then it will leave the client without a token and subsequent retries will continue to fail with `ErrUserEmpty` unless the token is somehow refreshed. This fix will refresh the token when `ErrUserEmpty` ("etcdserver: user name is empty") is returned by the server and the client has non-empty username, password.

In the meantime, we transitioned our Etcd to leverage stateless auth tokens “JWT”, which allows us to workaround the issue.

Sorry for the trouble!

forest · August 19, 2021, 3:49pm

I can confirm the error is related to etcd.

etcdserver: user name is empty

2021-08-18T14:00:42.166767938Z app[8960ea46] lax [info] sentinel | {"level":"warn","ts":"2021-08-18T14:00:42.157Z","logger":"etcd-client","caller":"v3@v3.5.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002bddc0/#initially=[etcd-na.fly-shared.net:443]","attempt":0,"error":"rpc error: code = InvalidArgument desc = etcdserver: user name is empty"}
2021-08-18T14:00:42.170906832Z app[8960ea46] lax [info] sentinel | 2021-08-18T14:00:42.166Z	ERROR	cmd/sentinel.go:1843	error retrieving cluster data	{"error": "etcdserver: user name is empty"}
2021-08-18T14:00:42.992525945Z app[8960ea46] lax [info] proxy    | {"level":"warn","ts":"2021-08-18T14:00:42.981Z","logger":"etcd-client","caller":"v3@v3.5.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002bddc0/#initially=[etcd-na.fly-shared.net:443]","attempt":0,"error":"rpc error: code = InvalidArgument desc = etcdserver: user name is empty"}

oleksify · August 19, 2021, 5:26pm

I’m having the same issue few times a day.

kurt · August 19, 2021, 5:27pm

This is the etcd bug @shaun mentioned in his post. We’re tracking it closely, it should improve today.

tjcberman · August 19, 2021, 5:55pm

Hi @shaun,

It is no problem (yet) as I am in early user testing. I guess the best thing is to keep an eye on the PR, for a full fix?? (I am reluctant to send it out as is given the reliability issues)

Thanks for all the good work. Impressed with Fly and once this is sorted will be even more so.

tjcberman · August 19, 2021, 6:54pm

Hi @shaun,

I just checked again and we seem to have hit the issue again (19:54 BST)

dhess · August 19, 2021, 9:28pm

We’re seeing the same issue. Two different PostgreSQL clusters, running on different instance types, both in LHR. They seem to fail every week or so now, starting about 3 weeks ago.

Our backend is implemented in Haskell, so I doubt this has anything to do with the app platform.

kurt · August 19, 2021, 9:30pm

These are all related to this, we’re working on it: Postgres fails constantly every 1-2 days for 2-3 min - #8 by shaun

shaun · August 19, 2021, 10:41pm

Update

JWT lead to more issues that are being tracked here: Bug: Auth revision w/ JWT auth tokens · Issue #13300 · etcd-io/etcd · GitHub

We are heavily considering reverting back to Consul…

aaronrussell · August 20, 2021, 2:28pm

Any update on this issue? Our db has been unreliable for 3 days now with a mix of different errors - I assume it’s all related to the above?

Current error:

  Version  = 0
  Status   = running
  Hostname = chrx-db1.fly.dev

Instances
ID       VERSION REGION DESIRED STATUS                 HEALTH CHECKS       RESTARTS CREATED
96f31500 0       lhr    run     running (context dead) 3 total, 3 critical 4        2021-08-18T22:18:08Z
f45afaef 0       lhr    run     running (context dead) 3 total, 3 critical 6        2021-08-18T14:13:42Z

VM status:

Instance
  ID            = f45afaef
  Version       = 0
  Region        = lhr
  Desired       = run
  Status        = running (rpc error: c)
  Health Checks = 3 total, 3 critical
  Restarts      = 6
  Created       = 2021-08-18T14:13:42Z

Recent Events
TIMESTAMP            TYPE             MESSAGE
2021-08-18T14:13:38Z Received         Task received by client
2021-08-18T14:13:38Z Task Setup       Building Task Directory
2021-08-18T14:13:40Z Started          Task started by client
2021-08-18T22:17:41Z Restart Signaled User requested restart
2021-08-18T22:17:44Z Terminated       Exit Code: 0
2021-08-18T22:17:44Z Restarting       Task restarting in 0s
2021-08-18T22:17:46Z Started          Task started by client
2021-08-19T18:38:39Z Restart Signaled User requested restart
2021-08-19T18:38:42Z Terminated       Exit Code: 0
2021-08-19T18:38:42Z Restarting       Task restarting in 0s
2021-08-19T18:38:44Z Started          Task started by client
2021-08-19T19:33:25Z Restart Signaled User requested restart
2021-08-19T19:38:30Z Terminated       Exit Code: 0
2021-08-19T19:38:30Z Restarting       Task restarting in 0s
2021-08-19T19:38:32Z Started          Task started by client
2021-08-19T19:38:41Z Restart Signaled User requested restart
2021-08-19T19:38:44Z Terminated       Exit Code: 0
2021-08-19T19:38:44Z Restarting       Task restarting in 0s
2021-08-19T19:38:46Z Started          Task started by client
2021-08-20T08:33:06Z Restart Signaled User requested restart
2021-08-20T08:33:12Z Terminated       Exit Code: 0
2021-08-20T08:33:12Z Restarting       Task restarting in 0s
2021-08-20T08:33:13Z Started          Task started by client
2021-08-20T12:39:50Z Restart Signaled User requested restart
2021-08-20T12:39:55Z Terminated       Exit Code: 0
2021-08-20T12:39:56Z Restarting       Task restarting in 0s
2021-08-20T12:39:57Z Started          Task started by client

Checks
ID   SERVICE STATE    OUTPUT
role app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": read unix @->control.sock: read: connection reset by peer
vm   app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": read unix @->control.sock: read: connection reset by peer
pg   app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": read unix @->control.sock: read: connection reset by peer

kurt · August 20, 2021, 2:31pm

Longer update coming, but if you run fly vm stop <id> on each of those instances it should recover and we think we’ve corrected most of the issues that get them in that state.

enaia · August 21, 2021, 6:09pm

It just happened to us again.

kurt · August 21, 2021, 6:10pm

What’s your db name? We’re cleaning these up right now (we got some good stability fixes in the builds).

Topic		Replies	Views
Fly Postgres proxy issues and dropped connections(started 2 hours ago) Questions / Help postgres	2	513	December 17, 2022
Elixir server can't connect to postgres db Questions / Help postgres	10	747	April 9, 2022
Connection Issues on Fly Postgres Region Singapore Questions / Help elixir , postgres	4	398	January 12, 2024
DB Connection issues elixir , postgres	18	2567	October 10, 2022
Unable to reach postgres instance postgres , proxy	31	414	October 24, 2024

Postgres fails constantly every 1-2 days for 2-3 min

Related topics