Can't reach database #ams #flycast

Hi,

I’m having issue with a PG database hosted in ams. My app that was working is suddenly not able to connect to the database. I try to redeploy, but deployment is failing for the same reason.

Error: P1001: Can't reach database server at `fbnb-xxx-db.flycast`:`5432`

Everything seems normal:

➜  ~ fly checks list -a fbnb-xxx-db
Health Checks for fbnb-xxx-db
  NAME | STATUS  | MACHINE        | LAST UPDATED         | OUTPUT
-------*---------*----------------*----------------------*-----------------------------------------------------------------------------
  pg   | passing | 4d891d60a16578 | 2023-08-22T23:57:14Z | [✓] connections: 11 used, 3 reserved, 300 max (3.6ms)
       |         |                |                      | [✓] cluster-locks: No active locks detected (8.88µs)
       |         |                |                      | [✓] disk-capacity: 14.4% - readonly mode will be enabled at 90.0% (9.21µs)
-------*---------*----------------*----------------------*-----------------------------------------------------------------------------
  role | passing | 4d891d60a16578 | 2023-08-22T23:57:17Z | primary
-------*---------*----------------*----------------------*-----------------------------------------------------------------------------
  vm   | passing | 4d891d60a16578 | 2023-08-22T23:57:11Z | [✓] checkDisk: 846 MB (85.6%) free space on /data/ (43.41µs)
       |         |                |                      | [✓] checkLoad: load averages: 0.01 0.04 0.16 (48.63µs)
       |         |                |                      | [✓] memory: system spent 0s of the last 60s waiting on memory (34.47µs)
       |         |                |                      | [✓] cpu: system spent 138ms of the last 60s waiting on cpu (14.64µs)
       |         |                |                      | [✓] io: system spent 0s of the last 60s waiting on io (12.85µs)
-------*---------*----------------*----------------------*-----------------------------------------------------------------------------

And I’m able to connect directly via fly proxy (although it is using .internal instead of .flycast):

➜  ~ fly proxy 15432:5432 -a fbnb-exam-db
Proxying local port 15432 to remote [fbnb-exam-db.internal]:5432
1 Like

I’m having the exact same issue. Not only can my app no longer reach the database, but the app itself is also very unresponsive. Both are hosted in the ams region. It may be related to this issue: App broken: could not find a good candidate within 90 attempts at load balancing. - #5 by Beaux

The weird thing is, using fly proxy works fine to connect to the database from my dev PC. So it seems the database app itself is fine, it’s just the proxy/networking between the app and the outside world seems to be broken.

I can also connect via ssh to my database app using fly ssh console --app my_database_app, but sshing to my main app gives error connecting to SSH server: connect tcp ... operation timed out.

Over ssh I ran pg_isready --host=localhost, and it says the database itself is fine and ready to accept connections.

Hi @DAlperin, Hi @jerome,

I noticed this post, and particularly this comment:

Could it be related?

The host where your machine is hosted is having issues.

It should show in your Fly dashboard.

1 Like

Hi @jerome,

Mmmh I was checking https://status.flyio.net/ and saw (and still see) nothing. Hence my question!

I didn’t notice this one in my dashboard. My bad!

Screenshot 2023-09-12 at 21.50.00

The service interruption began yesterday. Are there any updates on that? What would be an alternative to mitigate this issue?

Thank you.

The unfortunate answer to that is to have at least some redundancy (like a cluster of 2 machines for your postgres). That’s not a satisfying answer, especially after the fact.

Our upstream provider has identified this particular host suffers from a bad case of packet loss. Somehow we didn’t catch it earlier, we’ll be modifying our alerts to fix that.

The packet loss does not happen from all hosts, it’s a weird one. Likely why you’ve been able to access it via the internal network.

I’m trying to think of a temporary solution until the host is fixed, let me get some help from others.

@binajmen I think things are better now? We had to reboot the server and your database wasn’t starting anymore (we’re still investigating how it got into this state, but we were able to start it).

As far as I can tell, the host is now reachable from every other host. So your app should be able to communicate with the database.

1 Like

Thank you @jerome, I can confirm it is working again :muscle:

I will do that. Is there a guide on how to do it properly? I suppose I should choose a different region?

1 Like

It seems to have been fixed on my side as well. Thanks :+1:

@binajmen At first I didn’t understand how scaling would fix this issue, because I tried fly scale count 2 and it would still fail to connect. However I just found out you can scale your app across different regions like this:

fly scale count 3 --region yyz,ewr

That would’ve probably prevented this issue.

1 Like

You should choose close-by regions. Or using the same region, volumes will be put in different “zones” by default (or manually: fly volumes create --require-unique-zones).

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.