postgres slowness is happening again

OK - for now I’d suggest keeping it single region until we figure out the issue. It’s possible the latest version of fly-ruby is somehow using the wrong URL. We’ll take a look a this and think about what else might cause it.

1 Like

sure, will do that

thanks again!

I’m doing some debugging, trying to pinpoint the issue, so far, doesn’t seem like the fly-ruby is the culprit here.

this is what I’ve done so far:

  • Set PRIMARY_REGION back (fra)
  • scaled out postgres (fra and gru) and the web app (same regions)
  • set the DATABASE_URL to fatia-pizza.internal:5432. no fra.fatia-pizza nor top1.nearest.of

with that set up I’m logging the value of ActiveRecord::Base.connection_db_config.configuration_hash on GET and POST requests

this is a GET request on gru

Configuration: {:pool=>50, :adapter=>"postgresql", :username=>"xx", :password=>"xx", :port=>5433, :database=>"sumiu_web", :host=>"gru.fatia-pizza.internal"}

this is a GET request on fra

Configuration: {:pool=>50, :adapter=>"postgresql", :username=>"xx", :password=>"xx", :port=>5432, :database=>"sumiu_web", :host=>"fatia-pizza.internal"}

this is a POST request on fra

Configuration: {:pool=>50, :adapter=>"postgresql", :username=>"xx", :password=>"xx", :port=>5432, :database=>"sumiu_web", :host=>"fatia-pizza.internal"}

and the POST request on gru gets routed to fra and it is the same output as above.

so far, with this config, I saw one or two slow requests routed from gru to fra but I haven’t been able to reproduce it anymore.

one thing it is not clear to me: why (or when) should I use top1.nearest.of when fly-ruby can do pretty much the same thing? could the issue be due to having top1.nearest.of and the fly-ruby kinda fighting on the database URL, like fly-ruby says “I want this db” and the resolution says “no, have this one”? does that make sense?

I’ll let the app run for a bit with this setup and see what happens

This is probably the issue. It needs to use top1.nearest.of.fatia-pizza.internal (or fra.fatia-pizza.internal). fatia-pizza.internal returns all IPs for an app, regardless of region.

I think fly-ruby might be setting the wrong hostname for the primary region.

let me update the URL and see what happens

on gru it sets gru.top1.nearest.of.fatia-pizza.internal and on fra it uses top1.nearest.of.fatia-pizza.internal but here is the weird thing: fly dig gru and fly dig fra often resolves to the same ip:

❯ fly dig gru.top1.nearest.of.fatia-pizza.internal -a sumiu-web
;; opcode: QUERY, status: NOERROR, id: 65091
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;gru.top1.nearest.of.fatia-pizza.internal.	IN	 AAAA

;; ANSWER SECTION:
gru.top1.nearest.of.fatia-pizza.internal.	5	IN	AAAA	fdaa:0:6c8d:a7b:1f63:1:480b:2
;; opcode: QUERY, status: NOERROR, id: 61620
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;fra.top1.nearest.of.fatia-pizza.internal.	IN	 AAAA

;; ANSWER SECTION:
fra.top1.nearest.of.fatia-pizza.internal.	5	IN	AAAA	fdaa:0:6c8d:a7b:1f63:1:480b:2

then after a couple minutes, gru resolves a different IP

❯ fly dig gru.top1.nearest.of.fatia-pizza.internal -a sumiu-web
;; opcode: QUERY, status: NOERROR, id: 18956
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;gru.top1.nearest.of.fatia-pizza.internal.	IN	 AAAA

;; ANSWER SECTION:
gru.top1.nearest.of.fatia-pizza.internal.	5	IN	AAAA	fdaa:0:6c8d:a7b:23c4:1:3fb6:2

Right, so gru.top1... is not a valid DNS entry. fly-ruby doesn’t know about nearest hostnames. That said, with fly-ruby activated, using fatia-pizza.internal should be working. If not, fly-ruby is not hijacking correctly.

What was confusing me was that in FRA, you mentioned you had slow requests when using fra.fatia-pizza.internal as the URL. That should not happen, as fly-ruby does not activate at all in the primary region.

I’m going to setup an app to test as well. One thing you might try is commenting out workers and preload_app! in config/puma.rb. There may be an issue with forking servers interfering with fly-ruby hijacking.

hmm, so does that mean that fly-ruby and top1.nearest.of are incompatible and if you are using fly-ruby you should use the regular <app>.internal instead?

Yes - initially I thought this case was covered, but I was wrong. Again, though, the primary region should never be slow regardless of how fly-ruby is setup. If fly-ruby is activating in the primary region, there’s something really busted :slight_smile:

To be clear: using the raw hostname <app>.internal will give you random IPs. Using top1.nearest should work when we aren’t having issues like the first one that came up here. <region>.<app>.internal should always work in the primary region. fly-ruby should probably, then, make sure to cover all of these cases. I’ll take a look at this and report back.

1 Like

No, when using fra.fatia-pizza I don’t have any slowness, but then fly-ruby doesn’t really work, this is the hostname that it generates

:port=>5433, :database=>"sumiu_web", :host=>"gru.fra.fatia-pizza.internal"

I don’t think gru.fra.fatia-pizza is a valid DNS since they are at right now resolving the same IP address. Right now, I set fra.fatia-pizza. no slowness but also I’m not using the GRU replica.

when I use top1.nearest the db interactions jump from 20ms to 1166 seconds (funny that it is always 1166.6 ms, not sure why) so I’m not really sure what to do (other than not using multi region deploys)

For now, I recommend turning off multi-region deploys until we can fix fly-ruby to do the right thing here. I’m looking at it now. We’re thinking it should work like:

  • Primary region always uses <region>.<app>.internal
  • Secondary uses top1.nearest.of.<app>.internal

The latter will ensure that a deployment to a region without a replica can still work, even if it’s slower.

1 Like

cool, will do that
I’m gonna be off for a couple of weeks but I will try to help fly-ruby if I can :slight_smile:

your proposal makes sense, I think it should work like that

Going to merge this now, and test it on my end. Feel free to test as well at your convenience. Fix database connection hijacking by jsierles · Pull Request #22 · superfly/fly-ruby · GitHub

1 Like

amazing! that was fast hahahah

this is what I get in GRU

:port=>5433, :database=>"sumiu_web", :host=>"top1.nearest.of.fatia-pizza.internal"

and this is what I get on FRA

:port=>5432, :database=>"sumiu_web", :host=>"fra.fatia-pizza.internal"}

slowness is gone, GRU is slightly slower (~20ms), as expected since it has to replay some requests but the 7s delay is gone. gonna leave it up for a couple of hours but apparently, that was it

Nice! Yay for test suites. Are you using Redis or ActionCable in this app?

using Redis, but I don’t have a multi-region setup for it, just a single instance (and I deployed my own Redis setup with a custom Dockerfile)

What are you using it for? cache, sidekiq, both?

kinda both. I have a sidekiq instance running but I don’t have any jobs, I set up just in case. I’m using Redis as storage for rack-attack (that uses Rails.cache behind the scene)

Interesting. You could try the new managed redis option. See fly redis create. You can set it up with a replica in gru, and with a single URL, reads and writes should ‘just work’ as expected without intevention from your app. If rack-attack reads from Redis on every request, it might help with overall request latency.

1 Like

I tried but sidekiq exhausted the 10k commands in a blink of an eye lol

will disable sidekiq and try again, maybe have a managed redis just for cache and drop my own and set up sidekiq when/if I need

good idea, @jsierles