postgres slowness is happening again

jsierles · August 24, 2022, 11:40am

OK - for now I’d suggest keeping it single region until we figure out the issue. It’s possible the latest version of fly-ruby is somehow using the wrong URL. We’ll take a look a this and think about what else might cause it.

luizkowalski · August 24, 2022, 12:05pm

sure, will do that

thanks again!

luizkowalski · August 24, 2022, 3:53pm

I’m doing some debugging, trying to pinpoint the issue, so far, doesn’t seem like the fly-ruby is the culprit here.

this is what I’ve done so far:

Set PRIMARY_REGION back (fra)
scaled out postgres (fra and gru) and the web app (same regions)
set the DATABASE_URL to fatia-pizza.internal:5432. no fra.fatia-pizza nor top1.nearest.of

with that set up I’m logging the value of ActiveRecord::Base.connection_db_config.configuration_hash on GET and POST requests

this is a GET request on gru

Configuration: {:pool=>50, :adapter=>"postgresql", :username=>"xx", :password=>"xx", :port=>5433, :database=>"sumiu_web", :host=>"gru.fatia-pizza.internal"}

this is a GET request on fra

Configuration: {:pool=>50, :adapter=>"postgresql", :username=>"xx", :password=>"xx", :port=>5432, :database=>"sumiu_web", :host=>"fatia-pizza.internal"}

this is a POST request on fra

Configuration: {:pool=>50, :adapter=>"postgresql", :username=>"xx", :password=>"xx", :port=>5432, :database=>"sumiu_web", :host=>"fatia-pizza.internal"}

and the POST request on gru gets routed to fra and it is the same output as above.

so far, with this config, I saw one or two slow requests routed from gru to fra but I haven’t been able to reproduce it anymore.

one thing it is not clear to me: why (or when) should I use top1.nearest.of when fly-ruby can do pretty much the same thing? could the issue be due to having top1.nearest.of and the fly-ruby kinda fighting on the database URL, like fly-ruby says “I want this db” and the resolution says “no, have this one”? does that make sense?

I’ll let the app run for a bit with this setup and see what happens

kurt · August 24, 2022, 3:55pm

This is probably the issue. It needs to use top1.nearest.of.fatia-pizza.internal (or fra.fatia-pizza.internal). fatia-pizza.internal returns all IPs for an app, regardless of region.

I think fly-ruby might be setting the wrong hostname for the primary region.

luizkowalski · August 24, 2022, 4:08pm

let me update the URL and see what happens

luizkowalski · August 24, 2022, 4:17pm

on gru it sets gru.top1.nearest.of.fatia-pizza.internal and on fra it uses top1.nearest.of.fatia-pizza.internal but here is the weird thing: fly dig gru and fly dig fra often resolves to the same ip:

❯ fly dig gru.top1.nearest.of.fatia-pizza.internal -a sumiu-web
;; opcode: QUERY, status: NOERROR, id: 65091
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;gru.top1.nearest.of.fatia-pizza.internal.	IN	 AAAA

;; ANSWER SECTION:
gru.top1.nearest.of.fatia-pizza.internal.	5	IN	AAAA	fdaa:0:6c8d:a7b:1f63:1:480b:2

;; opcode: QUERY, status: NOERROR, id: 61620
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;fra.top1.nearest.of.fatia-pizza.internal.	IN	 AAAA

;; ANSWER SECTION:
fra.top1.nearest.of.fatia-pizza.internal.	5	IN	AAAA	fdaa:0:6c8d:a7b:1f63:1:480b:2

then after a couple minutes, gru resolves a different IP

❯ fly dig gru.top1.nearest.of.fatia-pizza.internal -a sumiu-web
;; opcode: QUERY, status: NOERROR, id: 18956
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;gru.top1.nearest.of.fatia-pizza.internal.	IN	 AAAA

;; ANSWER SECTION:
gru.top1.nearest.of.fatia-pizza.internal.	5	IN	AAAA	fdaa:0:6c8d:a7b:23c4:1:3fb6:2

jsierles · August 24, 2022, 4:32pm

Right, so gru.top1... is not a valid DNS entry. fly-ruby doesn’t know about nearest hostnames. That said, with fly-ruby activated, using fatia-pizza.internal should be working. If not, fly-ruby is not hijacking correctly.

What was confusing me was that in FRA, you mentioned you had slow requests when using fra.fatia-pizza.internal as the URL. That should not happen, as fly-ruby does not activate at all in the primary region.

I’m going to setup an app to test as well. One thing you might try is commenting out workers and preload_app! in config/puma.rb. There may be an issue with forking servers interfering with fly-ruby hijacking.

luizkowalski · August 24, 2022, 4:33pm

hmm, so does that mean that fly-ruby and top1.nearest.of are incompatible and if you are using fly-ruby you should use the regular <app>.internal instead?

jsierles · August 24, 2022, 4:34pm

Yes - initially I thought this case was covered, but I was wrong. Again, though, the primary region should never be slow regardless of how fly-ruby is setup. If fly-ruby is activating in the primary region, there’s something really busted

To be clear: using the raw hostname <app>.internal will give you random IPs. Using top1.nearest should work when we aren’t having issues like the first one that came up here. <region>.<app>.internal should always work in the primary region. fly-ruby should probably, then, make sure to cover all of these cases. I’ll take a look at this and report back.

luizkowalski · August 24, 2022, 5:02pm

No, when using fra.fatia-pizza I don’t have any slowness, but then fly-ruby doesn’t really work, this is the hostname that it generates

:port=>5433, :database=>"sumiu_web", :host=>"gru.fra.fatia-pizza.internal"

I don’t think gru.fra.fatia-pizza is a valid DNS since they are at right now resolving the same IP address. Right now, I set fra.fatia-pizza. no slowness but also I’m not using the GRU replica.

when I use top1.nearest the db interactions jump from 20ms to 1166 seconds (funny that it is always 1166.6 ms, not sure why) so I’m not really sure what to do (other than not using multi region deploys)

jsierles · August 24, 2022, 5:03pm

For now, I recommend turning off multi-region deploys until we can fix fly-ruby to do the right thing here. I’m looking at it now. We’re thinking it should work like:

Primary region always uses <region>.<app>.internal
Secondary uses top1.nearest.of.<app>.internal

The latter will ensure that a deployment to a region without a replica can still work, even if it’s slower.

luizkowalski · August 24, 2022, 5:06pm

cool, will do that
I’m gonna be off for a couple of weeks but I will try to help fly-ruby if I can

your proposal makes sense, I think it should work like that

jsierles · August 24, 2022, 5:46pm

Going to merge this now, and test it on my end. Feel free to test as well at your convenience. Fix database connection hijacking by jsierles · Pull Request #22 · superfly/fly-ruby · GitHub

luizkowalski · August 24, 2022, 6:12pm

amazing! that was fast hahahah

this is what I get in GRU

:port=>5433, :database=>"sumiu_web", :host=>"top1.nearest.of.fatia-pizza.internal"

and this is what I get on FRA

:port=>5432, :database=>"sumiu_web", :host=>"fra.fatia-pizza.internal"}

slowness is gone, GRU is slightly slower (~20ms), as expected since it has to replay some requests but the 7s delay is gone. gonna leave it up for a couple of hours but apparently, that was it

jsierles · August 24, 2022, 6:15pm

Nice! Yay for test suites. Are you using Redis or ActionCable in this app?

luizkowalski · August 24, 2022, 6:17pm

using Redis, but I don’t have a multi-region setup for it, just a single instance (and I deployed my own Redis setup with a custom Dockerfile)

jsierles · August 24, 2022, 6:20pm

What are you using it for? cache, sidekiq, both?

luizkowalski · August 24, 2022, 6:36pm

kinda both. I have a sidekiq instance running but I don’t have any jobs, I set up just in case. I’m using Redis as storage for rack-attack (that uses Rails.cache behind the scene)

jsierles · August 24, 2022, 6:43pm

Interesting. You could try the new managed redis option. See fly redis create. You can set it up with a replica in gru, and with a single URL, reads and writes should ‘just work’ as expected without intevention from your app. If rack-attack reads from Redis on every request, it might help with overall request latency.

luizkowalski · August 24, 2022, 6:46pm

I tried but sidekiq exhausted the 10k commands in a blink of an eye lol

will disable sidekiq and try again, maybe have a managed redis just for cache and drop my own and set up sidekiq when/if I need

good idea, @jsierles

Topic		Replies	Views
Slow Postgres performance? Build debugging postgres , rails , ruby	8	752	April 19, 2023
Experiencing very slow response times Questions / Help postgres	10	1581	October 5, 2024
Postgres CPU Spikes Questions / Help postgres	0	368	February 10, 2023
Significant PostgreSQL performance difference between apps v1 and v2 Questions / Help postgres	1	525	July 21, 2023
Fly Postgres Connections Extremely Slow from Django Questions / Help postgres , django	7	804	August 1, 2023

postgres slowness is happening again

Related topics