Possible issue with database

greg · April 1, 2021, 1:33pm

Hello,

I’ve noticed some issues with my postgres databases over the past 24 hours or so.

I got some errors about queries failing.

When I look in the logs for my app (I’m using node, with pg and pg-promise), I see:
Error: Connection terminated unexpectedly
… when I log err.message.

When I look in the database app logs (using flyctl logs) I see a load of this stuff (not sure how much is secret so I put APP and random). Which doesn’t … look good. Since it’s only a tail, not sure how far back these lines go:

2021-03-31T13:51:12.775Z 5a36188f lhr [info] keeper            | 2021-03-31T13:51:23.300Z	ERROR	cmd/keeper.go:839	failed to update keeper info	{"error": "Put \"https://consul-na.fly-shared.net/v1/kv/APP-random/APP/keepers/info/random?flags=3304740253564472344&wait=5000ms\": EOF"}
2021-03-31T13:51:46.791Z 5a36188f lhr [info] sentinel          | 2021-03-31T13:51:57.318Z	ERROR	cmd/sentinel.go:102	election loop error	{"error": "Put \"https://consul-na.fly-shared.net/v1/session/create?wait=5000ms\": EOF"}
2021-03-31T13:51:46.796Z 5a36188f lhr [info] sentinel          | panic: close of closed channel
2021-03-31T13:51:46.797Z 5a36188f lhr [info] sentinel          | goroutine 36 [running]:
2021-03-31T13:51:46.797Z 5a36188f lhr [info] sentinel          |
2021-03-31T13:51:46.798Z 5a36188f lhr [info] sentinel          | github.com/superfly/leadership.(*Candidate).Stop(...)
2021-03-31T13:51:46.799Z 5a36188f lhr [info] sentinel          | 	/go/pkg/mod/github.com/superfly/leadership@v0.2.0/candidate.go:66
2021-03-31T13:51:46.800Z 5a36188f lhr [info] sentinel          | github.com/sorintlab/stolon/internal/store.(*libkvElection).Stop(0xc00000c0c0)
2021-03-31T13:51:46.801Z 5a36188f lhr [info] sentinel          | 	/go/src/app/internal/store/libkv.go:108 +0x33
2021-03-31T13:51:46.802Z 5a36188f lhr [info] sentinel          | github.com/sorintlab/stolon/cmd/sentinel/cmd.(*Sentinel).electionLoop(0xc0001726c0, 0x19d84c0, 0xc0001f5380)
2021-03-31T13:51:46.803Z 5a36188f lhr [info] sentinel          | 	/go/src/app/cmd/sentinel/cmd/sentinel.go:108 +0x482
2021-03-31T13:51:46.804Z 5a36188f lhr [info] sentinel          | created by github.com/sorintlab/stolon/cmd/sentinel/cmd.(*Sentinel).Start
2021-03-31T13:51:46.805Z 5a36188f lhr [info] sentinel          | 	/go/src/app/cmd/sentinel/cmd/sentinel.go:1811 +0xab
2021-03-31T13:51:50.384Z 5a36188f lhr [info] keeper            | 2021-03-31T13:52:00.911Z	ERROR	cmd/keeper.go:1010	error retrieving cluster data	{"error": "Get \"https://consul-na.fly-shared.net/v1/kv/APP-random/APP/clusterdata?consistent=&wait=5000ms\": EOF"}

To further confuse things, I can no longer connect to the database from my local machine, despite having wireguard running and not changing anything like name, password. Same connection details.

All of the flyctl status stuff looks fine, it says healthcheck passed etc.

Yet I haven’t changed anything. I did re-deploy the app since to experiment with, but haven’t restarted the database apps or touched them. Not sure how to debug really. I could restart them, but wanted to see what had gone wrong first ideally.

Any thoughts?

greg · April 1, 2021, 1:50pm

I could run flyctl restart on the database apps presumably. But not sure if that’s a good idea … ?

Also, possible typo, reading your docs Postgres on Fly · Fly, I see it includes

flyctl postgres databases list

… but that command does not seem to exist?

greg · April 1, 2021, 2:10pm

Hmm … and without changing anything or restarting, I am now able to connect to one of my two database apps locally. About ten minutes ago, I couldn’t.

And the app it is attached to is returning data from it too. Same query, same connection details etc.

The other database app is still not responding though.

And now tailing logs isn’t responding. Or they have been deleted. Or the log debug level has been changed at your end. Since now the flyctl logs is not responding with anything. Hmm …

kurt · April 1, 2021, 2:40pm

Those consul errors in the logs are a problem, yes, seems like it’s having difficulty connecting to the shared consul service.

It is safe to restart the Postgres VMs, or flyctl vm stop <id>. When you are troubleshooting this, you should try connecting to both port 5432 and port 5433. 5432 is the proxy to connect you to “primary”, 5433 connects you directly to the postgres instance (which might be readonly/replica).

We’re investigating, that consul service is healthy. It’s weird to get an EOF from it. It might have something to do with consul running in NA and your database being in London.

greg · April 1, 2021, 2:53pm

Ok, so it may be to do with that then.

Interestingly I tried that locally, and yes, if I change my local connection to be 5433, I can connect using that. Just not using 5432.

And that would explain why the app can no longer connect to do its queries: I see the process.env.DATABASE_URL which I’m using to connect with (as the app is attached to the database app) ends 5432.

So the temporary fix would be to use 5433 but it sounds like that’s not a good idea since I want to INSERT too, and so those will fail if it’s read-only.

I can try a restart, but if it’s the proxy that’s the issue, that sounds like a restart won’t actually help since the VM itself is ok (as reported by healthchecks etc)?

kurt · April 1, 2021, 3:04pm

Is it working now? I restarted both VMs and the logs look good. I’m guessing what happened is Stolon gave up trying to detect a leader, and the connection proxy just wasn’t accepting connections. That proxy runs in the VM alongside the postgres process, so restarting probably helped.

Out of curiosity, did you happen to run flyctl status on the postgres when this was happening? And did it show health checks passing or failing? It looked like things were healthy by the time I looked, but we don’t have a connection proxy healthcheck built in (which we should do!)

greg · April 1, 2021, 3:12pm

No (well, one of them yes, one of them no).

I have two database apps. And each of those has two VMs (as that is built into your service).

Yesterday neither database app was working. Well, as in the 5432 wasn’t accepting connections, as it’s turned out. Hence neither me nor the app was able to connect, to either of them.

Without changing anything, one of the two database apps did start allowing connections on 5432 earlier today, as I and the app was able to connect without changing anything. I’m guessing that is the same app, the dev one, that you restarted just now? That app was already working from a little earlier today. Both 5432 and 5433.

It’s the other database app, the com one, that currently won’t let me connect on 5432, but will on 5433. So I’ve just checked flyctl status on that, and it reports all is well. 0 restarts reported. Healthchecks passing (I checked that as well yesterday, same thing then).

Yet the app and I can’t connect on 5432 to it, as of now.

greg · April 1, 2021, 3:21pm

Oh, and both apps had the page of consul errors in their logs yesterday. So it may be the same issue with both.

Now the dev one has auto-resolved itself after 24 hours of un-connectability (definitely a word). Not that it mattered in this case, hence the lack of urgency. The com one has not.

Does sound like a healthcheck on that proxy may be needed. As it sounds like that’s the cause of these issues. Or at least connected to it with the consul stuff. As that DATABASE_URL uses the proxy port and so if that’s not working, the database itself is out of action.

kurt · April 1, 2021, 3:29pm

Ok we’ll get that health check in place. We’ll also do some testing to get stolon to recover from that, it should have self corrected.

I restarted the VMs on your -com database, so it’s likely working again now.

greg · April 1, 2021, 3:39pm

Yep!

I just tried connecting to com on 5432 to it locally, and that worked. And the app attached to it also reports it is able to do queries and is returning results again.

So if I happen to notice this again, the answer is to restart. Turn it off and on again I understand it’s still in beta. I guess it’s not safe for production quite yet.

kurt · April 1, 2021, 3:45pm

Now that we’ve seen this particular problem, we can make it handle it next time. It looks like we hit this bug, I’m fairly sure we can workaround it.

The best bet right now is to setup alert handlers to make sure you’re notified when things go south. In hindsight, a “can we connect to the proxy” check is obvious.

greg · April 1, 2021, 3:54pm

Would those alerts have helped in this case? As from what I saw, the VM healthchecks all reported all was well. Even the pg specific stuff had ticks and ok.

I’d rather just have it work though. And not need restarting at 4am Sounds like you can fix the issues.

This got me thinking that even if there was a proxy service to external databases, it would still have the same issue: you’d have to 100% rely on that proxy working. Sigh. Nothing is ever simple!

The elusive goal I think is SQL-over-HTTP. Like a database you can use serverless-ly. All of the rivals in this space have their own issues.

kurt · April 1, 2021, 8:17pm

If you run flyctl checks list against your databases, you’ll see a spiffy new proxy check in there. We also patched the stolon lib to try and workaround the bug.

The most reliable way to connect to your database is to grab the private VM ips and make a connection string like this:

postgres://<user>:<password>@[fdaa:0:5d9:a7b:a98:0:fa3:2]:5433,[fdaa:0:5d9:a7b:a9a:0:fa2:2]:5433/database?target_session_attrs=read-write

Assuming you’re using a driver that uses libpq under the covers, that will try both IPs and give you a connection to the writable one and not rely on any proxies.

We haven’t quite figured out how to make this easy yet. We’ll probably ship libraries for it someday, it’s relatively simple to do the dns lookup to get the VM ips and make the URL.

greg · April 1, 2021, 8:40pm

Interesting. I wasn’t aware it could pick like that. Are those IPv6 the ones from flyctl info? And do they ever change?

I could hard-code that style of URL in to my own environment variable. Rather than use the DATABASE_URL, which would go via the proxy.

Or … you could change the DATABASE_URL to be instead be structured like that. But I guess that wouldn’t be suited for people with more than 2 VMs.

kurt · April 1, 2021, 8:42pm

Those are the IPs from flyctl ips private, you can use our DNS server to resolve them. What language are you using? I can probably give you a little snippet.

greg · April 1, 2021, 8:58pm

Nodejs.

But do you mean I would have to dynamically get them and build the connection string? I couldn’t just get them once and set that connection string in env?

kurt · April 1, 2021, 9:16pm

Which node module are you using?

You could set an env var/secret, yes, but that’s kind of brittle. And I’d like to have an example for node users anyway.

greg · April 1, 2021, 9:37pm

pg-promise

Which originally was to add promises to node-postgres. But then added more good stuff like named parameters etc.

I’m connecting like this:

const pgp = require('pg-promise')({
    capSQL: true
});

let pgdb = pgp({
       connectionString: process.env.DATABASE_URL
});

pier · March 1, 2022, 11:03pm

@greg were you able to solve this issue? Did it also happen when the app was running in Fly instead of locally?

I’m getting this error seemingly randomly when running the app on my dev machine with Wireguard:

node:events:498
      throw er; // Unhandled 'error' event
      ^

Error: Connection terminated unexpectedly

The weird thing is that I am handling the PG errors, but this error seems to happen out of nowhere.

greg · March 1, 2022, 11:25pm

@pier Alas no. I had it when running on Fly.

Kurt mentioned this bug:

github.com/sorintlab/stolon

Stolon-sentinel fails after Consul cluster partitioning

opened 01:22PM - 22 Dec 20 UTC

iamalryz

bug

**What happened**: I have Stolon installation with Consul cluster of 3 servers …as a storage backend for Stolon. Sometimes there are problems on the network that result in Consul cluster partitioning. As a result, Stolon-sentinel can't retrieve its cluster data and loses its leadership until Consul returns to a consistent state. But after two `election loop error`s Stolon-sentinel fails with status `INVALIDARGUMENT`. Logs example: ``` 2020-12-22T12:39:20.431Z INFO cmd/sentinel.go:94 sentinel leadership lost 2020-12-22T12:39:20.429Z ERROR cmd/sentinel.go:1807 error retrieving cluster data {"error": "Unexpected response code: 500"} 2020-12-22T12:39:30.437Z ERROR cmd/sentinel.go:102 election loop error {"error": "Unexpected response code: 500 (rpc error getting client: failed to get conn: dial tcp <nil>->172.23.0.3:8300: i /o timeout)"} 2020-12-22T12:39:32.494Z ERROR cmd/sentinel.go:1807 error retrieving cluster data {"error": "Unexpected response code: 500"} 2020-12-22T12:39:40.438Z INFO cmd/sentinel.go:82 Trying to acquire sentinels leadership 2020-12-22T12:39:44.818Z ERROR cmd/sentinel.go:1807 error retrieving cluster data {"error": "Unexpected response code: 500"} 2020-12-22T12:39:47.746Z ERROR cmd/sentinel.go:102 election loop error {"error": "Unexpected response code: 500 (rpc error making call: No cluster leader)"} panic: close of closed channel goroutine 39 [running]: github.com/docker/leadership.(*Candidate).Stop(...) /go/pkg/mod/github.com/docker/leadership@v0.1.0/candidate.go:66 github.com/sorintlab/stolon/internal/store.(*libkvElection).Stop(0xc00019c220) /stolon/internal/store/libkv.go:110 +0x33 github.com/sorintlab/stolon/cmd/sentinel/cmd.(*Sentinel).electionLoop(0xc000342000, 0x1a62b60, 0xc0000d29c0) /stolon/cmd/sentinel/cmd/sentinel.go:108 +0x478 created by github.com/sorintlab/stolon/cmd/sentinel/cmd.(*Sentinel).Start /stolon/cmd/sentinel/cmd/sentinel.go:1775 +0xab ``` There is no difference in how many Stolon-sentinels are present in the cluster. They all fail. **What you expected to happen**: It seems like Stolon-sentinel should not fail if the underlying storage is temporarily unavailable. **How to reproduce it (as minimally and precisely as possible)**: You can easily reproduce this problem using the docker-compose manifest and the bootstrap script in this archive: [sentinel-panic-lab.zip](https://github.com/sorintlab/stolon/files/5730034/sentinel-panic-lab.zip). Usage: - Unpack the archive. - Start the script: ``` bash ./bootstrap.sh ``` It performs the next steps: - bootstraps a Consul server cluster with 3 containers and Consul agent in a separate container; - initializes Stolon cluster; - bootstraps one Stolon-sentinel container; - bootstraps 2 containers with Stolon-keeper. When the environment has been bootstrapped, you can stop two Consul server containers and see Stolon-sentinel logs. **Environment**: - Stolon version: tested on v0.15.0 and v0.16.0 - Stolon running environment (if useful to understand the bug): Docker, bare metal - Others: - Consul version: tested on 1.7.2 and 1.9

… I just checked and that is still showing as open. Hmm.

You could perhaps see if Kurt’s suggested fix/workaround of using the IPv6 in the connection URL helps? Other than that, not sure.

Topic		Replies	Views
Postgres database not responding at all Questions / Help	3	930	February 21, 2022
Database down again	11	772	June 23, 2021
Postgres down. Again Questions / Help postgres	2	398	February 22, 2023
Postgres clusters periodically down across many of our organizations Questions / Help postgres	7	1648	October 13, 2022
Postgres database apps are crashing again	22	1203	October 25, 2022

Possible issue with database

Related topics