I’ve noticed some issues with my postgres databases over the past 24 hours or so.
I got some errors about queries failing.
When I look in the logs for my app (I’m using node, with pg and pg-promise), I see: Error: Connection terminated unexpectedly
… when I log err.message.
When I look in the database app logs (using flyctl logs) I see a load of this stuff (not sure how much is secret so I put APP and random). Which doesn’t … look good. Since it’s only a tail, not sure how far back these lines go:
To further confuse things, I can no longer connect to the database from my local machine, despite having wireguard running and not changing anything like name, password. Same connection details.
All of the flyctl status stuff looks fine, it says healthcheck passed etc.
Yet I haven’t changed anything. I did re-deploy the app since to experiment with, but haven’t restarted the database apps or touched them. Not sure how to debug really. I could restart them, but wanted to see what had gone wrong first ideally.
Hmm … and without changing anything or restarting, I am now able to connect to one of my two database apps locally. About ten minutes ago, I couldn’t.
And the app it is attached to is returning data from it too. Same query, same connection details etc.
The other database app is still not responding though.
And now tailing logs isn’t responding. Or they have been deleted. Or the log debug level has been changed at your end. Since now the flyctl logs is not responding with anything. Hmm …
Those consul errors in the logs are a problem, yes, seems like it’s having difficulty connecting to the shared consul service.
It is safe to restart the Postgres VMs, or flyctl vm stop <id>. When you are troubleshooting this, you should try connecting to both port 5432 and port 5433. 5432 is the proxy to connect you to “primary”, 5433 connects you directly to the postgres instance (which might be readonly/replica).
We’re investigating, that consul service is healthy. It’s weird to get an EOF from it. It might have something to do with consul running in NA and your database being in London.
Interestingly I tried that locally, and yes, if I change my local connection to be 5433, I can connect using that. Just not using 5432.
And that would explain why the app can no longer connect to do its queries: I see the process.env.DATABASE_URL which I’m using to connect with (as the app is attached to the database app) ends 5432.
So the temporary fix would be to use 5433 but it sounds like that’s not a good idea since I want to INSERT too, and so those will fail if it’s read-only.
I can try a restart, but if it’s the proxy that’s the issue, that sounds like a restart won’t actually help since the VM itself is ok (as reported by healthchecks etc)?
Is it working now? I restarted both VMs and the logs look good. I’m guessing what happened is Stolon gave up trying to detect a leader, and the connection proxy just wasn’t accepting connections. That proxy runs in the VM alongside the postgres process, so restarting probably helped.
Out of curiosity, did you happen to run flyctl status on the postgres when this was happening? And did it show health checks passing or failing? It looked like things were healthy by the time I looked, but we don’t have a connection proxy healthcheck built in (which we should do!)
I have two database apps. And each of those has two VMs (as that is built into your service).
Yesterday neither database app was working. Well, as in the 5432 wasn’t accepting connections, as it’s turned out. Hence neither me nor the app was able to connect, to either of them.
Without changing anything, one of the two database apps did start allowing connections on 5432 earlier today, as I and the app was able to connect without changing anything. I’m guessing that is the same app, the dev one, that you restarted just now? That app was already working from a little earlier today. Both 5432 and 5433.
It’s the other database app, the com one, that currently won’t let me connect on 5432, but will on 5433. So I’ve just checked flyctl status on that, and it reports all is well. 0 restarts reported. Healthchecks passing (I checked that as well yesterday, same thing then).
Yet the app and I can’t connect on 5432 to it, as of now.
Oh, and both apps had the page of consul errors in their logs yesterday. So it may be the same issue with both.
Now the dev one has auto-resolved itself after 24 hours of un-connectability (definitely a word). Not that it mattered in this case, hence the lack of urgency. The com one has not.
Does sound like a healthcheck on that proxy may be needed. As it sounds like that’s the cause of these issues. Or at least connected to it with the consul stuff. As that DATABASE_URL uses the proxy port and so if that’s not working, the database itself is out of action.
I just tried connecting to com on 5432 to it locally, and that worked. And the app attached to it also reports it is able to do queries and is returning results again.
So if I happen to notice this again, the answer is to restart. Turn it off and on again I understand it’s still in beta. I guess it’s not safe for production quite yet.
Now that we’ve seen this particular problem, we can make it handle it next time. It looks like we hit this bug, I’m fairly sure we can workaround it.
The best bet right now is to setup alert handlers to make sure you’re notified when things go south. In hindsight, a “can we connect to the proxy” check is obvious.
Would those alerts have helped in this case? As from what I saw, the VM healthchecks all reported all was well. Even the pg specific stuff had ticks and ok.
I’d rather just have it work though. And not need restarting at 4am Sounds like you can fix the issues.
This got me thinking that even if there was a proxy service to external databases, it would still have the same issue: you’d have to 100% rely on that proxy working. Sigh. Nothing is ever simple!
The elusive goal I think is SQL-over-HTTP. Like a database you can use serverless-ly. All of the rivals in this space have their own issues.
If you run flyctl checks list against your databases, you’ll see a spiffy new proxy check in there. We also patched the stolon lib to try and workaround the bug.
The most reliable way to connect to your database is to grab the private VM ips and make a connection string like this:
Assuming you’re using a driver that uses libpq under the covers, that will try both IPs and give you a connection to the writable one and not rely on any proxies.
We haven’t quite figured out how to make this easy yet. We’ll probably ship libraries for it someday, it’s relatively simple to do the dns lookup to get the VM ips and make the URL.
Those are the IPs from flyctl ips private, you can use our DNS server to resolve them. What language are you using? I can probably give you a little snippet.
But do you mean I would have to dynamically get them and build the connection string? I couldn’t just get them once and set that connection string in env?