Possible issue with database

Hello,

I’ve noticed some issues with my postgres databases over the past 24 hours or so.

I got some errors about queries failing.

When I look in the logs for my app (I’m using node, with pg and pg-promise), I see:
Error: Connection terminated unexpectedly
… when I log err.message.

When I look in the database app logs (using flyctl logs) I see a load of this stuff (not sure how much is secret so I put APP and random). Which doesn’t … look good. Since it’s only a tail, not sure how far back these lines go:

2021-03-31T13:51:12.775Z 5a36188f lhr [info] keeper            | 2021-03-31T13:51:23.300Z	ERROR	cmd/keeper.go:839	failed to update keeper info	{"error": "Put \"https://consul-na.fly-shared.net/v1/kv/APP-random/APP/keepers/info/random?flags=3304740253564472344&wait=5000ms\": EOF"}
2021-03-31T13:51:46.791Z 5a36188f lhr [info] sentinel          | 2021-03-31T13:51:57.318Z	ERROR	cmd/sentinel.go:102	election loop error	{"error": "Put \"https://consul-na.fly-shared.net/v1/session/create?wait=5000ms\": EOF"}
2021-03-31T13:51:46.796Z 5a36188f lhr [info] sentinel          | panic: close of closed channel
2021-03-31T13:51:46.797Z 5a36188f lhr [info] sentinel          | goroutine 36 [running]:
2021-03-31T13:51:46.797Z 5a36188f lhr [info] sentinel          |
2021-03-31T13:51:46.798Z 5a36188f lhr [info] sentinel          | github.com/superfly/leadership.(*Candidate).Stop(...)
2021-03-31T13:51:46.799Z 5a36188f lhr [info] sentinel          | 	/go/pkg/mod/github.com/superfly/leadership@v0.2.0/candidate.go:66
2021-03-31T13:51:46.800Z 5a36188f lhr [info] sentinel          | github.com/sorintlab/stolon/internal/store.(*libkvElection).Stop(0xc00000c0c0)
2021-03-31T13:51:46.801Z 5a36188f lhr [info] sentinel          | 	/go/src/app/internal/store/libkv.go:108 +0x33
2021-03-31T13:51:46.802Z 5a36188f lhr [info] sentinel          | github.com/sorintlab/stolon/cmd/sentinel/cmd.(*Sentinel).electionLoop(0xc0001726c0, 0x19d84c0, 0xc0001f5380)
2021-03-31T13:51:46.803Z 5a36188f lhr [info] sentinel          | 	/go/src/app/cmd/sentinel/cmd/sentinel.go:108 +0x482
2021-03-31T13:51:46.804Z 5a36188f lhr [info] sentinel          | created by github.com/sorintlab/stolon/cmd/sentinel/cmd.(*Sentinel).Start
2021-03-31T13:51:46.805Z 5a36188f lhr [info] sentinel          | 	/go/src/app/cmd/sentinel/cmd/sentinel.go:1811 +0xab
2021-03-31T13:51:50.384Z 5a36188f lhr [info] keeper            | 2021-03-31T13:52:00.911Z	ERROR	cmd/keeper.go:1010	error retrieving cluster data	{"error": "Get \"https://consul-na.fly-shared.net/v1/kv/APP-random/APP/clusterdata?consistent=&wait=5000ms\": EOF"}

To further confuse things, I can no longer connect to the database from my local machine, despite having wireguard running and not changing anything like name, password. Same connection details.

All of the flyctl status stuff looks fine, it says healthcheck passed etc.

Yet I haven’t changed anything. I did re-deploy the app since to experiment with, but haven’t restarted the database apps or touched them. Not sure how to debug really. I could restart them, but wanted to see what had gone wrong first ideally.

Any thoughts?

I could run flyctl restart on the database apps presumably. But not sure if that’s a good idea … ?

Also, possible typo, reading your docs Postgres on Fly · Fly, I see it includes

flyctl postgres databases list

… but that command does not seem to exist?

Hmm … and without changing anything or restarting, I am now able to connect to one of my two database apps locally. About ten minutes ago, I couldn’t.

And the app it is attached to is returning data from it too. Same query, same connection details etc.

The other database app is still not responding though.

And now tailing logs isn’t responding. Or they have been deleted. Or the log debug level has been changed at your end. Since now the flyctl logs is not responding with anything. Hmm …

Those consul errors in the logs are a problem, yes, seems like it’s having difficulty connecting to the shared consul service.

It is safe to restart the Postgres VMs, or flyctl vm stop <id>. When you are troubleshooting this, you should try connecting to both port 5432 and port 5433. 5432 is the proxy to connect you to “primary”, 5433 connects you directly to the postgres instance (which might be readonly/replica).

We’re investigating, that consul service is healthy. It’s weird to get an EOF from it. It might have something to do with consul running in NA and your database being in London.

Ok, so it may be to do with that then.

Interestingly I tried that locally, and yes, if I change my local connection to be 5433, I can connect using that. Just not using 5432.

And that would explain why the app can no longer connect to do its queries: I see the process.env.DATABASE_URL which I’m using to connect with (as the app is attached to the database app) ends 5432.

So the temporary fix would be to use 5433 but it sounds like that’s not a good idea since I want to INSERT too, and so those will fail if it’s read-only.

I can try a restart, but if it’s the proxy that’s the issue, that sounds like a restart won’t actually help since the VM itself is ok (as reported by healthchecks etc)?

Is it working now? I restarted both VMs and the logs look good. I’m guessing what happened is Stolon gave up trying to detect a leader, and the connection proxy just wasn’t accepting connections. That proxy runs in the VM alongside the postgres process, so restarting probably helped.

Out of curiosity, did you happen to run flyctl status on the postgres when this was happening? And did it show health checks passing or failing? It looked like things were healthy by the time I looked, but we don’t have a connection proxy healthcheck built in (which we should do!)

No (well, one of them yes, one of them no).

I have two database apps. And each of those has two VMs (as that is built into your service).

Yesterday neither database app was working. Well, as in the 5432 wasn’t accepting connections, as it’s turned out. Hence neither me nor the app was able to connect, to either of them.

Without changing anything, one of the two database apps did start allowing connections on 5432 earlier today, as I and the app was able to connect without changing anything. I’m guessing that is the same app, the dev one, that you restarted just now? That app was already working from a little earlier today. Both 5432 and 5433.

It’s the other database app, the com one, that currently won’t let me connect on 5432, but will on 5433. So I’ve just checked flyctl status on that, and it reports all is well. 0 restarts reported. Healthchecks passing (I checked that as well yesterday, same thing then).

Yet the app and I can’t connect on 5432 to it, as of now.

Oh, and both apps had the page of consul errors in their logs yesterday. So it may be the same issue with both.

Now the dev one has auto-resolved itself after 24 hours of un-connectability (definitely a word). Not that it mattered in this case, hence the lack of urgency. The com one has not.

Does sound like a healthcheck on that proxy may be needed. As it sounds like that’s the cause of these issues. Or at least connected to it with the consul stuff. As that DATABASE_URL uses the proxy port and so if that’s not working, the database itself is out of action.

Ok we’ll get that health check in place. We’ll also do some testing to get stolon to recover from that, it should have self corrected.

I restarted the VMs on your -com database, so it’s likely working again now.

Yep!

I just tried connecting to com on 5432 to it locally, and that worked. And the app attached to it also reports it is able to do queries and is returning results again.

So if I happen to notice this again, the answer is to restart. Turn it off and on again :slight_smile: I understand it’s still in beta. I guess it’s not safe for production quite yet.

Now that we’ve seen this particular problem, we can make it handle it next time. It looks like we hit this bug, I’m fairly sure we can workaround it.

The best bet right now is to setup alert handlers to make sure you’re notified when things go south. In hindsight, a “can we connect to the proxy” check is obvious.

Would those alerts have helped in this case? As from what I saw, the VM healthchecks all reported all was well. Even the pg specific stuff had ticks and ok.

I’d rather just have it work though. And not need restarting at 4am :slight_smile: Sounds like you can fix the issues.

This got me thinking that even if there was a proxy service to external databases, it would still have the same issue: you’d have to 100% rely on that proxy working. Sigh. Nothing is ever simple!

The elusive goal I think is SQL-over-HTTP. Like a database you can use serverless-ly. All of the rivals in this space have their own issues.

If you run flyctl checks list against your databases, you’ll see a spiffy new proxy check in there. We also patched the stolon lib to try and workaround the bug.

The most reliable way to connect to your database is to grab the private VM ips and make a connection string like this:

postgres://<user>:<password>@[fdaa:0:5d9:a7b:a98:0:fa3:2]:5433,[fdaa:0:5d9:a7b:a9a:0:fa2:2]:5433/database?target_session_attrs=read-write

Assuming you’re using a driver that uses libpq under the covers, that will try both IPs and give you a connection to the writable one and not rely on any proxies.

We haven’t quite figured out how to make this easy yet. We’ll probably ship libraries for it someday, it’s relatively simple to do the dns lookup to get the VM ips and make the URL.

Interesting. I wasn’t aware it could pick like that. Are those IPv6 the ones from flyctl info? And do they ever change?

I could hard-code that style of URL in to my own environment variable. Rather than use the DATABASE_URL, which would go via the proxy.

Or … you could change the DATABASE_URL to be instead be structured like that. But I guess that wouldn’t be suited for people with more than 2 VMs.

Those are the IPs from flyctl ips private, you can use our DNS server to resolve them. What language are you using? I can probably give you a little snippet.

Nodejs.

But do you mean I would have to dynamically get them and build the connection string? I couldn’t just get them once and set that connection string in env?

Which node module are you using?

You could set an env var/secret, yes, but that’s kind of brittle. And I’d like to have an example for node users anyway.

pg-promise

Which originally was to add promises to node-postgres. But then added more good stuff like named parameters etc.

I’m connecting like this:

const pgp = require('pg-promise')({
    capSQL: true
});

let pgdb = pgp({
       connectionString: process.env.DATABASE_URL
});