Cannot connect to SEA region DB

Paradox · April 1, 2022, 12:50am

I have not been able to connect to my database running in the SEA region all day. Is this related to the connectivity issues earlier? I still cannot connect as of writing this post

dad · April 1, 2022, 12:52am

I’m recently started getting database connection errors in my logs:

hkg [info]00:44:19.832 [error] Postgrex.Protocol (#PID<0.2706.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv (idle): closed

The app in question “crm-backend-prod” is deployed in both sea and hkg, with a primary db in sea and a replica in hkg. It looks like hkg cannot access the primary.

The status page seems to suggest there are no known issues.

alanb · April 1, 2022, 12:57am

Same here

2022-04-01T00:56:18.815 app[96ee012e] sea [info] 00:56:18.814 [error] Postgrex.Protocol (#PID<0.2573.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed

kurt · April 1, 2022, 12:59am

I’m looking at these, there are no obvious issues but it does seem like something is interfering.

kurt · April 1, 2022, 1:02am

The crm-backend-prod DB had a node in an unhealthy state (fly status -a <db name> showed it). I stopped that instance with fly vm stop <id>, and it came back up healthy. It looks like the app logs are happy now?

kurt · April 1, 2022, 1:06am

@Paradox I see a recent restart on one of your DBs, did that clear anything up? I don’t see any issues with the prod-db.

How are you trying to connect? Is your app failing to connect or are you attempting to connect from your local machine?

alanb · April 1, 2022, 1:08am

I had restarted both db and app but was seeing same problem.
However, I’ve just restarted both again and it seems to be working again.

dad · April 1, 2022, 1:11am

Thanks mate - that app is back.

Do I need to put monitoring in place that tracks the status of unhealthy nodes? I kind of expected fly to handle this but I can do that if required.

kurt · April 1, 2022, 1:12am

I’m still digging, but the problem seemed to be one of the DB instances missing from internal DNS. fly restart wouldn’t have fixed it, in this case, but stopping the VM entirely forced it to re-register.

I believe what happened here is that the network issues on one of the hosts (which is hosting your DB) caused your DB to failover. The replica didn’t have the DNS entry, once it took over the other DBs couldn’t connect (nor could your app).

We should have caught this issue, but it’s always worth monitoring your DB. At the very least it gives you a good idea of when it’s an app level problem vs something on our end.

kurt · April 1, 2022, 1:17am

If you come across this thread looking for DB connection errors, here’s what you need to do:

Check the status of your DB VMs with fly status -a <db-name>
If you see VMs that aren’t passing health checks, run fly checks list -a <db-name>. This will give you some hints about what might be broken.
Check the DB IPs and DNS entries for your database:
- fly ips private will show the internal IPs for each VM
- fly dig aaaa <db-name>.internal should show the same IPs

Paradox · April 1, 2022, 1:37am

I’m trying to connect via a Wireguard tunnel, or via the flyctl commands. Both time out eventually. I can connect to our app, or other DBs in other regions, but not the SEA db

We were seeing some 502s earlier today for one of our apps, but they seem to have mostly resolved themselves

kurt · April 1, 2022, 1:39am

We might have just fixed the issues you were having. Wireguard peers and flyctl both connect through gateways. Our gateways were having issues routing to some VMs. This was related to the earlier Seattle outage, but not exactly the same problem we found a few hours ago.

Can you give it another try and see if that helps?

Paradox · April 1, 2022, 1:41am

I can now connect to the DB.

dad · April 29, 2022, 10:37am

Seems like this issue has popped its ugly head again. I’ve followed the instructions listed by @kurt and I can see that my pg node is unhealthy

role critical b7639787   hkg    HTTP 7h38m ago    failed to connect to local
                                                  node: context deadline
                                                  exceeded
pg   critical b7639787   hkg    HTTP 7h38m ago    HTTP GET
                                                  http://172.19.1.74:5500/flycheck/pg:
                                                  500 Internal Server Error Output:
                                                  "failed to connect to proxy: context
                                                  deadline exceeded"

What do I do from here?

dad · April 29, 2022, 10:43am

fyi flyctl restart -a ... fixed the issue which is “cool” but it would be better if the app handled this itself. On our end it is hard to tell the difference between a blip and a serious outage - so we were down for a bit until I figured this out.

Topic		Replies	Views
Postgres DBConnection and other errors on Fly Questions / Help postgres	10	2773	July 4, 2025
Deployments can't connect to database Questions / Help	4	425	September 6, 2022
DB Connection issues elixir , postgres	18	2569	October 10, 2022
Suddenly unable to connect to my external database in SEA region, switching region fixes.	3	365	March 31, 2022
Incident right now postgres	8	89	September 23, 2024

Cannot connect to SEA region DB

Related topics