I’m having issue with a PG database hosted in ams. My app that was working is suddenly not able to connect to the database. I try to redeploy, but deployment is failing for the same reason.
Error: P1001: Can't reach database server at `fbnb-xxx-db.flycast`:`5432`
Everything seems normal:
➜ ~ fly checks list -a fbnb-xxx-db
Health Checks for fbnb-xxx-db
NAME | STATUS | MACHINE | LAST UPDATED | OUTPUT
-------*---------*----------------*----------------------*-----------------------------------------------------------------------------
pg | passing | 4d891d60a16578 | 2023-08-22T23:57:14Z | [✓] connections: 11 used, 3 reserved, 300 max (3.6ms)
| | | | [✓] cluster-locks: No active locks detected (8.88µs)
| | | | [✓] disk-capacity: 14.4% - readonly mode will be enabled at 90.0% (9.21µs)
-------*---------*----------------*----------------------*-----------------------------------------------------------------------------
role | passing | 4d891d60a16578 | 2023-08-22T23:57:17Z | primary
-------*---------*----------------*----------------------*-----------------------------------------------------------------------------
vm | passing | 4d891d60a16578 | 2023-08-22T23:57:11Z | [✓] checkDisk: 846 MB (85.6%) free space on /data/ (43.41µs)
| | | | [✓] checkLoad: load averages: 0.01 0.04 0.16 (48.63µs)
| | | | [✓] memory: system spent 0s of the last 60s waiting on memory (34.47µs)
| | | | [✓] cpu: system spent 138ms of the last 60s waiting on cpu (14.64µs)
| | | | [✓] io: system spent 0s of the last 60s waiting on io (12.85µs)
-------*---------*----------------*----------------------*-----------------------------------------------------------------------------
And I’m able to connect directly via fly proxy (although it is using .internal instead of .flycast):
➜ ~ fly proxy 15432:5432 -a fbnb-exam-db
Proxying local port 15432 to remote [fbnb-exam-db.internal]:5432
The weird thing is, using fly proxy works fine to connect to the database from my dev PC. So it seems the database app itself is fine, it’s just the proxy/networking between the app and the outside world seems to be broken.
I can also connect via ssh to my database app using fly ssh console --app my_database_app, but sshing to my main app gives error connecting to SSH server: connect tcp ... operation timed out.
Over ssh I ran pg_isready --host=localhost, and it says the database itself is fine and ready to accept connections.
The unfortunate answer to that is to have at least some redundancy (like a cluster of 2 machines for your postgres). That’s not a satisfying answer, especially after the fact.
Our upstream provider has identified this particular host suffers from a bad case of packet loss. Somehow we didn’t catch it earlier, we’ll be modifying our alerts to fix that.
The packet loss does not happen from all hosts, it’s a weird one. Likely why you’ve been able to access it via the internal network.
I’m trying to think of a temporary solution until the host is fixed, let me get some help from others.
@binajmen I think things are better now? We had to reboot the server and your database wasn’t starting anymore (we’re still investigating how it got into this state, but we were able to start it).
As far as I can tell, the host is now reachable from every other host. So your app should be able to communicate with the database.
It seems to have been fixed on my side as well. Thanks
@binajmen At first I didn’t understand how scaling would fix this issue, because I tried fly scale count 2 and it would still fail to connect. However I just found out you can scale your app across different regions like this:
You should choose close-by regions. Or using the same region, volumes will be put in different “zones” by default (or manually: fly volumes create --require-unique-zones).