But it fails after timeout (2/3 checks). Here is the list of machines:
fly status -a raczekteam-db
ID STATE ROLE REGION CHECKS IMAGE CREATED UPDATED
7811359a92dde8 started replica waw 3 total, 2 passing, 1 critical flyio/postgres-flex:15.3 (v0.0.46) 2024-02-02T00:02:13Z 2024-02-02T00:27:02Z
4d8979df452587 started primary waw 3 total, 3 passing flyio/postgres-flex:15.3 (v0.0.46) 2023-06-30T13:14:38Z 2024-02-01T23:38:01Z
3287961a027328 started replica waw 3 total, 3 passing flyio/postgres-flex:15.3 (v0.0.46) 2024-02-01T23:03:38Z 2024-02-01T23:36:14Z
and
fly checks list -a raczekteam-db
Health Checks for raczekteam-db
NAME | STATUS | MACHINE | LAST UPDATED | OUTPUT
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
pg | passing | 3287961a027328 | 1h30m ago | [✓] connections: 10 used, 3 reserved, 300 max (3.86ms)
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
role | passing | 3287961a027328 | 1h30m ago | replica
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
vm | passing | 3287961a027328 | 1h30m ago | [✓] checkDisk: 823.34 MB (83.5%) free space on /data/ (91.16µs)
| | | | [✓] checkLoad: load averages: 0.00 0.00 0.00 (81.29µs)
| | | | [✓] memory: system spent 0s of the last 60s waiting on memory (69.13µs)
| | | | [✓] cpu: system spent 612ms of the last 60s waiting on cpu (48.91µs)
| | | | [✓] io: system spent 630ms of the last 60s waiting on io (23.52µs)
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
pg | passing | 4d8979df452587 | 5h9m ago | [✓] connections: 18 used, 3 reserved, 300 max (16.61ms)
| | | | [✓] cluster-locks: No active locks detected (21.05µs)
| | | | [✓] disk-capacity: 16.5% - readonly mode will be enabled at 90.0% (158.67µs)
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
role | passing | 4d8979df452587 | 56m33s ago | primary
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
vm | passing | 4d8979df452587 | 5h4m ago | [✓] checkDisk: 823.27 MB (83.5%) free space on /data/ (97.39µs)
| | | | [✓] checkLoad: load averages: 1.02 1.22 0.23 (89.53µs)
| | | | [✓] memory: system spent 276ms of the last 60s waiting on memory (68.8µs)
| | | | [✓] cpu: system spent 1.86s of the last 60s waiting on cpu (62.69µs)
| | | | [✓] io: system spent 150ms of the last 60s waiting on io (37.67µs)
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
pg | passing | 7811359a92dde8 | 7m25s ago | [✓] connections: 10 used, 3 reserved, 300 max (55.92ms)
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
role | passing | 7811359a92dde8 | 14m10s ago | replica
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
vm | critical | 7811359a92dde8 | 31m56s ago | connect: connection refused
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
I’m really tired of all the problems with Fly.io. Everytime I have to change something, there is something that doesn’t work…
These health checks only update when the underlying state changes. Your connections are working just fine, but since the underlying VM check is still failing, the text never updates. It’s kind of a bummer, but there are reasons for this…
Anyways, you can find the actual issue by ssh’ing into the machine that has the failing check:
fly ssh console -s 7811359a92dde8 -a raczekteam-db
? Select VM: waw: 7811359a92dde8 fdaa:2:6ec2:a7b:18e:23d2:789b:2 cold-hill-9640 (replica) (app)
Error: host unavailable at 7811359a92dde8: host was not found in DNS
I can’t even ssh into the machine. Also, I’ve checked forum and tried to see if IPs are assigned as suggested in one of the threads:
flyctl ips list --app raczekteam-db
VERSION IP TYPE REGION CREATED AT
v6 fdaa:2:6ec2:0:1::6 private global Jun 30 2023 13:14
Learn more about Fly.io public, private, shared and dedicated IP addresses in our docs: https://fly.io/docs/reference/services/#ip-addresses
flyctl ips private --app raczekteam-db
ID REGION IP
4d8979df452587 waw fdaa:2:6ec2:a7b:c8:a887:fec5:2
3287961a027328 waw fdaa:2:6ec2:a7b:8c:ae2e:4ab0:2
7811359a92dde8 waw fdaa:2:6ec2:a7b:18e:23d2:789b:2
For some reason suddenly it allowed me to ssh into the machine. I haven’t changed anything… Here is the output of curl:
curl http://[fdaa:2:6ec2:a7b:18e:23d2:789b:2]:5500/flycheck/vm
[✓] checkDisk: 813.38 MB (83.5%) free space on /data/ (62.67µs)
[✓] checkLoad: load averages: 0.73 0.75 0.70 (83.65µs)
[✓] memory: system spent 0s of the last 60s waiting on memory (35.31µs)
[✗] cpu: system spent 1.88s of the last 10 seconds waiting on cpu (42.56µs)
[✓] io: system spent 2.02s of the last 60s waiting on io (23.93µs)root@7811359a92dde8:/#
I don’t know what happened with Fly.io in recent days but database queries became super slow. Just fetching all records (3000) records from super simple table go into 700-1200 ms numbers. Before it was much faster. I’ve already scaled from 256 MB to 512 MB and 1 shared CPU to 2 shared CPUs.
Please don’t suggest to me scaling machines as it’s ridiculous. The database size is just 25 MB (data only). Even the simplest machine should handle that without any problems. Also why is it possible that 2 other machines with the same spec are fine now but just this one can’t start? It doesn’t make sense.
The client is mad on me for app being slow. There are some problems with Fly.io every single month. I’m not expert with servers but I’m just thinking about switching to dedicated server or just changing service provider. I’m also not going to spend much more money on the server to handle 30 users with very very little traffic.
Also, please tell me what does it mean? [✗] cpu: system spent 1.88s of the last 10 seconds waiting on cpu (42.56µs)
You’re using shared CPU’s so you’re going to be susceptible to noisy neighbor problems. However, given this is your replica, it shouldn’t be causing any issues unless you’re pushing all of your read to that node specifically.
Also why is it possible that 2 other machines with the same spec are fine now but just this one can’t start? It doesn’t make sense.
I don’t see any issues with your machine not being able to start?
The client is mad on me for app being slow.
With regards to slow queries, there are numerous reasons why that could be happening. I would recommend configuring log-min-duration-statement via fly pg config to see if you can track down which queries specifically are having issues, if any. I would also recommend paginating your queries and not reading all 3000 records in at once.
Also make sure to monitor the metrics within the Fly dashboard to make sure there’s nothing going on there. This goes for both your App and Database.
I’m not sure how the shared CPU works but I would assume that’s it’s not like you’re trying to direct traffic to node that is busy it’s just that one CPU might be once used by one machine and sometimes by other.
I don’t see any issues with your machine not being able to start?
I’m not sure what you mean by that. If it’s not starting because of some CPU issue that why you don’t see issue there? Why should I do in this situation? I’m paying for the CPU that I can’t use? It’s ridiculous.
With regards to slow queries, there are numerous reasons why that could be happening.
The example of 3000 records is just “an example”. I’m experienced developer and I’m not doing such things. I have proper indexes set and everything should be snappy.
But anyway, everything started working smoothly 1 hour ago. I did nothing. Also the machine that didn’t want to start was stopped by me this morning and someone (I guess someone from Fly.io) started it for me. So I guess you had some issue with the WAW server or something similar and you’re not talking about it. I did nothing to fix it, and issue went away.
The other reason could be that recently the Postgres machine crashed because of lack of memory. I’ve increased memory but it continued being slow. Maybe server was in the “incorrect state”. But from what I remember I’ve restarted server. Maybe it just went into correct state? But the fact that someone from Fly.io was messing with my machines state makes me think that it was Fly.io’s server issue.
If it’s not starting because of some CPU issue that why you don’t see issue there? Why should I do in this situation? I’m paying for the CPU that I can’t use? It’s ridiculous.
There’s no indication that CPU load was preventing your Machine from booting. Fwiw, that specific failed health check doesn’t mean the Machine isn’t running, but it can be an indicator that CPU could be impacting performance on that member. Basically, if the check is failing on your replica and you don’t have synchronous replication configured, replica isn’t falling out of sync, etc. it should be safe to ignore it.
If self-serve is causing you issues and think a managed solution would offer less headaches, I would consider checking out Supabase:
Hmmm I might actually try Supabase. I’ve tried it in the past but wasn’t happy about the performance from what I remember but it was mostly because of db and app server being in different regions. I see that closest to WAW is Frankfurt but I’m not sure if the performance will be great. Will have to do some test.
Also I switched to performance cpu for the database and it’s a little bit faster but still not great. I had two slowdowns already with database in two days and it’s totally unrelated to usage or app. It just randomly starts being slow even without any users. Something is definitely going on with the server. And I guess metrics doesn’t tell the full story. Can you can with someone if the WAW server has some problems? I don’t see the same problem with app in other locations.
EDIT:
Also funny things is that suddenly servers started working super fast. It’s 9-10 pm dough so maybe not the highest usage. What is weird is that even switching to performance cpu (that I have on production) is not as fast as shared cpu on staging server. For me it looks like you have some issues with the server in Warsaw. I will do more tests by moving staging server to other country and comparing performance in rush hours. But again staging server has 0 users and its performance also suffers.
Shaun here is referring to our new partnership with Supabase: databases are deployed on Fly.io infrastrcture, right next to your apps. If you’re interested, we can add you to the private beta to try it out.
Done! All your orgs can now provision Supabase databases. Check out our docs here. The org hosting your existing database app also got some credits to use for testing.