Can't clone Postgres machine

lukejagodzinski · February 2, 2024, 12:35am

I’m trying to clone the Postgres machine:

fly machine clone 3287961a027328 --region waw --app raczekteam-db

But it fails after timeout (2/3 checks). Here is the list of machines:

fly status -a raczekteam-db
ID            	STATE  	ROLE   	REGION	CHECKS                        	IMAGE                             	CREATED             	UPDATED
7811359a92dde8	started	replica	waw   	3 total, 2 passing, 1 critical	flyio/postgres-flex:15.3 (v0.0.46)	2024-02-02T00:02:13Z	2024-02-02T00:27:02Z	
4d8979df452587	started	primary	waw   	3 total, 3 passing            	flyio/postgres-flex:15.3 (v0.0.46)	2023-06-30T13:14:38Z	2024-02-01T23:38:01Z	
3287961a027328	started	replica	waw   	3 total, 3 passing            	flyio/postgres-flex:15.3 (v0.0.46)	2024-02-01T23:03:38Z	2024-02-01T23:36:14Z

and

fly checks list -a raczekteam-db
Health Checks for raczekteam-db
  NAME | STATUS   | MACHINE        | LAST UPDATED | OUTPUT
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
  pg   | passing  | 3287961a027328 | 1h30m ago    | [✓] connections: 10 used, 3 reserved, 300 max (3.86ms)
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
  role | passing  | 3287961a027328 | 1h30m ago    | replica
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
  vm   | passing  | 3287961a027328 | 1h30m ago    | [✓] checkDisk: 823.34 MB (83.5%) free space on /data/ (91.16µs)
       |          |                |              | [✓] checkLoad: load averages: 0.00 0.00 0.00 (81.29µs)
       |          |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (69.13µs)
       |          |                |              | [✓] cpu: system spent 612ms of the last 60s waiting on cpu (48.91µs)
       |          |                |              | [✓] io: system spent 630ms of the last 60s waiting on io (23.52µs)
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
  pg   | passing  | 4d8979df452587 | 5h9m ago     | [✓] connections: 18 used, 3 reserved, 300 max (16.61ms)
       |          |                |              | [✓] cluster-locks: No active locks detected (21.05µs)
       |          |                |              | [✓] disk-capacity: 16.5% - readonly mode will be enabled at 90.0% (158.67µs)
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
  role | passing  | 4d8979df452587 | 56m33s ago   | primary
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
  vm   | passing  | 4d8979df452587 | 5h4m ago     | [✓] checkDisk: 823.27 MB (83.5%) free space on /data/ (97.39µs)
       |          |                |              | [✓] checkLoad: load averages: 1.02 1.22 0.23 (89.53µs)
       |          |                |              | [✓] memory: system spent 276ms of the last 60s waiting on memory (68.8µs)
       |          |                |              | [✓] cpu: system spent 1.86s of the last 60s waiting on cpu (62.69µs)
       |          |                |              | [✓] io: system spent 150ms of the last 60s waiting on io (37.67µs)
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
  pg   | passing  | 7811359a92dde8 | 7m25s ago    | [✓] connections: 10 used, 3 reserved, 300 max (55.92ms)
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
  role | passing  | 7811359a92dde8 | 14m10s ago   | replica
-------*----------*----------------*--------------*-------------------------------------------------------------------------------
  vm   | critical | 7811359a92dde8 | 31m56s ago   | connect: connection refused
-------*----------*----------------*--------------*-------------------------------------------------------------------------------

I’m really tired of all the problems with Fly.io. Everytime I have to change something, there is something that doesn’t work…

Can you help me?

shaun · February 2, 2024, 3:12am

These health checks only update when the underlying state changes. Your connections are working just fine, but since the underlying VM check is still failing, the text never updates. It’s kind of a bummer, but there are reasons for this…

Anyways, you can find the actual issue by ssh’ing into the machine that has the failing check:

fly ssh console -s 7811359a92dde8

Once you’re in the console, run the following:

curl http://[<machines-private-ip>]:5500/flycheck/vm

Note: The machines private ip should be displayed when you ssh into the machine

The result of that curl command should tell you what’s going on. Hint: Your CPU load is high.

Hope that helps.

lukejagodzinski · February 2, 2024, 10:48am

Hey, this is what I get:

fly ssh console -s 7811359a92dde8 -a raczekteam-db
? Select VM: waw: 7811359a92dde8 fdaa:2:6ec2:a7b:18e:23d2:789b:2 cold-hill-9640 (replica) (app)
Error: host unavailable at 7811359a92dde8: host was not found in DNS

I can’t even ssh into the machine. Also, I’ve checked forum and tried to see if IPs are assigned as suggested in one of the threads:

flyctl ips list --app raczekteam-db
VERSION	IP                	TYPE   	REGION	CREATED AT
v6     	fdaa:2:6ec2:0:1::6	private	global	Jun 30 2023 13:14	

Learn more about Fly.io public, private, shared and dedicated IP addresses in our docs: https://fly.io/docs/reference/services/#ip-addresses

flyctl ips private --app raczekteam-db
ID            	REGION	IP
4d8979df452587	waw   	fdaa:2:6ec2:a7b:c8:a887:fec5:2 	
3287961a027328	waw   	fdaa:2:6ec2:a7b:8c:ae2e:4ab0:2 	
7811359a92dde8	waw   	fdaa:2:6ec2:a7b:18e:23d2:789b:2

Everything seems fine on the IP front?

lukejagodzinski · February 2, 2024, 11:15am

EDIT:

For some reason suddenly it allowed me to ssh into the machine. I haven’t changed anything… Here is the output of curl:

curl http://[fdaa:2:6ec2:a7b:18e:23d2:789b:2]:5500/flycheck/vm
[✓] checkDisk: 813.38 MB (83.5%) free space on /data/ (62.67µs)
[✓] checkLoad: load averages: 0.73 0.75 0.70 (83.65µs)
[✓] memory: system spent 0s of the last 60s waiting on memory (35.31µs)
[✗] cpu: system spent 1.88s of the last 10 seconds waiting on cpu (42.56µs)
[✓] io: system spent 2.02s of the last 60s waiting on io (23.93µs)root@7811359a92dde8:/#

shaun · February 2, 2024, 3:31pm

If this is something you’re worried about, you should consider scaling.

lukejagodzinski · February 2, 2024, 3:41pm

I don’t know what happened with Fly.io in recent days but database queries became super slow. Just fetching all records (3000) records from super simple table go into 700-1200 ms numbers. Before it was much faster. I’ve already scaled from 256 MB to 512 MB and 1 shared CPU to 2 shared CPUs.

Please don’t suggest to me scaling machines as it’s ridiculous. The database size is just 25 MB (data only). Even the simplest machine should handle that without any problems. Also why is it possible that 2 other machines with the same spec are fine now but just this one can’t start? It doesn’t make sense.

The client is mad on me for app being slow. There are some problems with Fly.io every single month. I’m not expert with servers but I’m just thinking about switching to dedicated server or just changing service provider. I’m also not going to spend much more money on the server to handle 30 users with very very little traffic.

Also, please tell me what does it mean? [✗] cpu: system spent 1.88s of the last 10 seconds waiting on cpu (42.56µs)

shaun · February 2, 2024, 4:24pm

You’re using shared CPU’s so you’re going to be susceptible to noisy neighbor problems. However, given this is your replica, it shouldn’t be causing any issues unless you’re pushing all of your read to that node specifically.

Also why is it possible that 2 other machines with the same spec are fine now but just this one can’t start? It doesn’t make sense.

I don’t see any issues with your machine not being able to start?

The client is mad on me for app being slow.

With regards to slow queries, there are numerous reasons why that could be happening. I would recommend configuring log-min-duration-statement via fly pg config to see if you can track down which queries specifically are having issues, if any. I would also recommend paginating your queries and not reading all 3000 records in at once.

Also make sure to monitor the metrics within the Fly dashboard to make sure there’s nothing going on there. This goes for both your App and Database.

mayailurus · February 2, 2024, 5:52pm

Added postgres

lukejagodzinski · February 2, 2024, 8:20pm

I’m not sure how the shared CPU works but I would assume that’s it’s not like you’re trying to direct traffic to node that is busy it’s just that one CPU might be once used by one machine and sometimes by other.

I don’t see any issues with your machine not being able to start?

I’m not sure what you mean by that. If it’s not starting because of some CPU issue that why you don’t see issue there? Why should I do in this situation? I’m paying for the CPU that I can’t use? It’s ridiculous.

With regards to slow queries, there are numerous reasons why that could be happening.

The example of 3000 records is just “an example”. I’m experienced developer and I’m not doing such things. I have proper indexes set and everything should be snappy.

But anyway, everything started working smoothly 1 hour ago. I did nothing. Also the machine that didn’t want to start was stopped by me this morning and someone (I guess someone from Fly.io) started it for me. So I guess you had some issue with the WAW server or something similar and you’re not talking about it. I did nothing to fix it, and issue went away.

The other reason could be that recently the Postgres machine crashed because of lack of memory. I’ve increased memory but it continued being slow. Maybe server was in the “incorrect state”. But from what I remember I’ve restarted server. Maybe it just went into correct state? But the fact that someone from Fly.io was messing with my machines state makes me think that it was Fly.io’s server issue.

shaun · February 2, 2024, 9:09pm

If it’s not starting because of some CPU issue that why you don’t see issue there? Why should I do in this situation? I’m paying for the CPU that I can’t use? It’s ridiculous.

There’s no indication that CPU load was preventing your Machine from booting. Fwiw, that specific failed health check doesn’t mean the Machine isn’t running, but it can be an indicator that CPU could be impacting performance on that member. Basically, if the check is failing on your replica and you don’t have synchronous replication configured, replica isn’t falling out of sync, etc. it should be safe to ignore it.

If self-serve is causing you issues and think a managed solution would offer less headaches, I would consider checking out Supabase:

lukejagodzinski · February 3, 2024, 8:00pm

Hmmm I might actually try Supabase. I’ve tried it in the past but wasn’t happy about the performance from what I remember but it was mostly because of db and app server being in different regions. I see that closest to WAW is Frankfurt but I’m not sure if the performance will be great. Will have to do some test.

Also I switched to performance cpu for the database and it’s a little bit faster but still not great. I had two slowdowns already with database in two days and it’s totally unrelated to usage or app. It just randomly starts being slow even without any users. Something is definitely going on with the server. And I guess metrics doesn’t tell the full story. Can you can with someone if the WAW server has some problems? I don’t see the same problem with app in other locations.

EDIT:
Also funny things is that suddenly servers started working super fast. It’s 9-10 pm dough so maybe not the highest usage. What is weird is that even switching to performance cpu (that I have on production) is not as fast as shared cpu on staging server. For me it looks like you have some issues with the server in Warsaw. I will do more tests by moving staging server to other country and comparing performance in rush hours. But again staging server has 0 users and its performance also suffers.

joshua-fly · February 5, 2024, 9:32am

Shaun here is referring to our new partnership with Supabase: databases are deployed on Fly.io infrastrcture, right next to your apps. If you’re interested, we can add you to the private beta to try it out.

lukejagodzinski · February 5, 2024, 10:57am

Oh ok. Then yes, please add me to private beta. Thanks

joshua-fly · February 5, 2024, 11:22am

Done! All your orgs can now provision Supabase databases. Check out our docs here. The org hosting your existing database app also got some credits to use for testing.

system · February 12, 2024, 11:23am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
fly clone machine fails after barman postgres recovery postgres	3	255	April 18, 2024
Issues with Postgres connection timeout Questions / Help postgres , machines	2	244	May 22, 2024
Can't increase RAM in Postgres instance. No active machines in process group 'app', check `fly status` output postgres	5	539	February 9, 2024
Postgres machines down? Questions / Help postgres	4	527	April 12, 2024
Postgres APP cloning issue Questions / Help postgres , machines , volumes	2	17	March 20, 2025

Can't clone Postgres machine

Related topics