postgres slowness is happening again

couple of months ago I noticed a certain slowness in my app. after lots of debugging, @jsierles pointed to top2.nearest.of being the root cause of the issue, most likely I should’ve been using top1. After updating it, the slowness was gone

This morning, I upgraded the fly-ruby gem to 0.4.1 and I noticed that the slowness is back, this is one of the logs I got from my app:

2022-08-23T13:14:03.222 app[5e53d40d] fra [info] [4b874cd4-d9b4-4176-aa2d-37b77aa987dd] {"method":"POST","path":"/m","format":"*/*","controller":"MessagesController","action":"create","status":200,"duration":10844.91,"view":4.36,"db":7785.25,"host":"sumiu.link","ip":"46.161.11.227","time":"2022-08-23T15:14:03+02:00"}

notice the db":7785.25. Over 7s on db processing. I tried to roll back the gem, thinking it could be the issue, and tried to scale up the machine to 512Mb (currently running on 256Mb) also to no success.

I haven’t changed anything, didn’t introduce any new feature nor do I have a spike of users coming.

are there any known issues with postgres at this time? what else could cause this

So rolling back the gem didn’t help here? This being a POST request, it should have been replayed to the primary instance. Do you have the PRIMARY_REGION env var set?

Nope, doesn’t look like the gem is at fault in here.

This being a POST request, it should have been replayed to the primary instance.

I’m hitting the primary instance always, both for read/write since I’m sitting in Berlin and the primary instance is fra.

Do you have the PRIMARY_REGION env var set?

yup, I’ve never changed that

And your DATABASE_URL is set to top1.nearest.of...? Can you try fly dig top1.nearest.of.pg-app-name.internal -a appname? Does the IP there match the IP of fly dig fra.pg-app-name.internal -a appname?

hmm, no, they match…sometimes…
running multiple times, they match like 90% of the time, that probably explains why now and then I see some requests go through real quick while other hang for 7+ seconds

Huh, that’s not good. For now, what you can do is set your URL hostname to postgres-app-name.internal. fly-ruby should automatically adjust the URL for the secondary region to point to the regional replica.

1 Like

I didn’t see any difference tbh, time is still inconsistent, here are two requests:

db:1166.66

and

db:7788.06

no noticeable time difference here

Where is your secondary located? A last attempt could be to change the host do fra.pg-app-name.internal and only run in the primary region, to just test the primary.

secondary is located in gru

setting the db URL to fra.pg-app.internal worked

db":5.15"

seems like something is wrong with the replicas/routing

Could you share the org name where your pg app is located, or the name of that app? It would be helpful to see the health of the pg app.

If it’s flapping for some reason - particularly the primary - it would be normal for the IPs to change like that. You can get some status on the pg app with fly status -a pg-app-name and fly logs -a pg-app-name.

UPDATE: We found your app and are looking into what might cause this. For now, I’d recommend keeping the app running only in fra with the current hostname so visitors in gru don’t get slow reads.

the app name is “fatia-pizza”, apps connected to it are “sumiu-web” and “sumiu-worker”

Ok this was an issue with our app mis-detecting ping times for your database instances, making the results of top1.nearest.of fail. It should be fixed now. We added a health check so we’ll get alerted if this ever happens again.

1 Like

holy shit that’s why I love fly

all is working now! thanks a lot, you both

update: it is back again :frowning_face:

@kurt it is happening again

fly dig top1.nearest.of and fly dig fra.pg-app from sumiu-web are returning different ips as before

Sorry about this. Is this still happening right now? I checked just now and it looks like the IPs are now resolving correctly.

yup, still happening, the IPs are resolving correctly but the slowness is here. changing from top1.nearest.of to fra.pg-app didn’t help too, which is weird. scaling in postgres to only 1 (forcing it to have one and only one instance in fra) actually solved it…somehow seems like the requests are still going to gru

Weird! Which version of fly-ruby are you using now?

from github’s main branch, actually

OK - and you still have PRIMARY_REGION set? I’d suggest perhaps disabling fly-ruby by unsetting PRIMARY_REGION, and switching back to fra.pgapp. Then scaling up the cluster. If requests in fra are fast, then we might suspect it’s fly-ruby related, though that seems unlikely here.

It’s worth also double checking that the correct URL is set on the FRA VM via fly ssh console.

I triple-checked PRIMARY_REGION and it is set to fra on both instances (fra and gru). Disabling the middleware by removing it, works but then it kills the multi-region feature, everything is routed to fra (currently using VPN and setting it to Brazil and Argentina)

It’s worth also double checking that the correct URL is set on the FRA VM via fly ssh console .

I did that a hundred times already, cause I was thinking I might have screwed up and did something weird, I don’t know but the database URL is correct, tried a number of URLs:

  • postgres://fatia-pizza.internal:5432/sumiu_web
  • postgres://fra.fatia-pizza.internal:5432/sumiu_web
  • postgres://top1.nearest.of.fatia-pizza.internal:5432/sumiu_web