Cannot connect to SEA region DB

I have not been able to connect to my database running in the SEA region all day. Is this related to the connectivity issues earlier? I still cannot connect as of writing this post

I’m recently started getting database connection errors in my logs:

hkg [info]00:44:19.832 [error] Postgrex.Protocol (#PID<0.2706.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv (idle): closed

The app in question “crm-backend-prod” is deployed in both sea and hkg, with a primary db in sea and a replica in hkg. It looks like hkg cannot access the primary.

The status page seems to suggest there are no known issues.

Same here

2022-04-01T00:56:18.815 app[96ee012e] sea [info] 00:56:18.814 [error] Postgrex.Protocol (#PID<0.2573.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed

I’m looking at these, there are no obvious issues but it does seem like something is interfering.

The crm-backend-prod DB had a node in an unhealthy state (fly status -a <db name> showed it). I stopped that instance with fly vm stop <id>, and it came back up healthy. It looks like the app logs are happy now?

@Paradox I see a recent restart on one of your DBs, did that clear anything up? I don’t see any issues with the prod-db.

How are you trying to connect? Is your app failing to connect or are you attempting to connect from your local machine?

I had restarted both db and app but was seeing same problem.
However, I’ve just restarted both again and it seems to be working again.

1 Like

Thanks mate - that app is back.

Do I need to put monitoring in place that tracks the status of unhealthy nodes? I kind of expected fly to handle this but I can do that if required.

I’m still digging, but the problem seemed to be one of the DB instances missing from internal DNS. fly restart wouldn’t have fixed it, in this case, but stopping the VM entirely forced it to re-register.

I believe what happened here is that the network issues on one of the hosts (which is hosting your DB) caused your DB to failover. The replica didn’t have the DNS entry, once it took over the other DBs couldn’t connect (nor could your app).

We should have caught this issue, but it’s always worth monitoring your DB. At the very least it gives you a good idea of when it’s an app level problem vs something on our end.

If you come across this thread looking for DB connection errors, here’s what you need to do:

  1. Check the status of your DB VMs with fly status -a <db-name>
  2. If you see VMs that aren’t passing health checks, run fly checks list -a <db-name>. This will give you some hints about what might be broken.
  3. Check the DB IPs and DNS entries for your database:
    • fly ips private will show the internal IPs for each VM
    • fly dig aaaa <db-name>.internal should show the same IPs

I’m trying to connect via a Wireguard tunnel, or via the flyctl commands. Both time out eventually. I can connect to our app, or other DBs in other regions, but not the SEA db

We were seeing some 502s earlier today for one of our apps, but they seem to have mostly resolved themselves

We might have just fixed the issues you were having. Wireguard peers and flyctl both connect through gateways. Our gateways were having issues routing to some VMs. This was related to the earlier Seattle outage, but not exactly the same problem we found a few hours ago.

Can you give it another try and see if that helps?

1 Like

I can now connect to the DB. :+1:

Seems like this issue has popped its ugly head again. I’ve followed the instructions listed by @kurt and I can see that my pg node is unhealthy

role critical b7639787   hkg    HTTP 7h38m ago    failed to connect to local
                                                  node: context deadline
pg   critical b7639787   hkg    HTTP 7h38m ago    HTTP GET
                                                  500 Internal Server Error Output:
                                                  "failed to connect to proxy: context
                                                  deadline exceeded"

What do I do from here?

fyi flyctl restart -a ... fixed the issue which is “cool” but it would be better if the app handled this itself. On our end it is hard to tell the difference between a blip and a serious outage - so we were down for a bit until I figured this out.