OK - for now I’d suggest keeping it single region until we figure out the issue. It’s possible the latest version of fly-ruby
is somehow using the wrong URL. We’ll take a look a this and think about what else might cause it.
sure, will do that
thanks again!
I’m doing some debugging, trying to pinpoint the issue, so far, doesn’t seem like the fly-ruby is the culprit here.
this is what I’ve done so far:
- Set
PRIMARY_REGION
back (fra
) - scaled out postgres (
fra
andgru
) and the web app (same regions) - set the DATABASE_URL to
fatia-pizza.internal:5432
. nofra.fatia-pizza
nortop1.nearest.of
with that set up I’m logging the value of ActiveRecord::Base.connection_db_config.configuration_hash
on GET and POST requests
this is a GET request on gru
Configuration: {:pool=>50, :adapter=>"postgresql", :username=>"xx", :password=>"xx", :port=>5433, :database=>"sumiu_web", :host=>"gru.fatia-pizza.internal"}
this is a GET request on fra
Configuration: {:pool=>50, :adapter=>"postgresql", :username=>"xx", :password=>"xx", :port=>5432, :database=>"sumiu_web", :host=>"fatia-pizza.internal"}
this is a POST request on fra
Configuration: {:pool=>50, :adapter=>"postgresql", :username=>"xx", :password=>"xx", :port=>5432, :database=>"sumiu_web", :host=>"fatia-pizza.internal"}
and the POST request on gru
gets routed to fra
and it is the same output as above.
so far, with this config, I saw one or two slow requests routed from gru
to fra
but I haven’t been able to reproduce it anymore.
one thing it is not clear to me: why (or when) should I use top1.nearest.of
when fly-ruby
can do pretty much the same thing? could the issue be due to having top1.nearest.of
and the fly-ruby
kinda fighting on the database URL, like fly-ruby says “I want this db” and the resolution says “no, have this one”? does that make sense?
I’ll let the app run for a bit with this setup and see what happens
This is probably the issue. It needs to use top1.nearest.of.fatia-pizza.internal
(or fra.fatia-pizza.internal
). fatia-pizza.internal
returns all IPs for an app, regardless of region.
I think fly-ruby might be setting the wrong hostname for the primary region.
let me update the URL and see what happens
on gru
it sets gru.top1.nearest.of.fatia-pizza.internal
and on fra
it uses top1.nearest.of.fatia-pizza.internal
but here is the weird thing: fly dig gru
and fly dig fra
often resolves to the same ip:
❯ fly dig gru.top1.nearest.of.fatia-pizza.internal -a sumiu-web
;; opcode: QUERY, status: NOERROR, id: 65091
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;gru.top1.nearest.of.fatia-pizza.internal. IN AAAA
;; ANSWER SECTION:
gru.top1.nearest.of.fatia-pizza.internal. 5 IN AAAA fdaa:0:6c8d:a7b:1f63:1:480b:2
;; opcode: QUERY, status: NOERROR, id: 61620
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;fra.top1.nearest.of.fatia-pizza.internal. IN AAAA
;; ANSWER SECTION:
fra.top1.nearest.of.fatia-pizza.internal. 5 IN AAAA fdaa:0:6c8d:a7b:1f63:1:480b:2
then after a couple minutes, gru
resolves a different IP
❯ fly dig gru.top1.nearest.of.fatia-pizza.internal -a sumiu-web
;; opcode: QUERY, status: NOERROR, id: 18956
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;gru.top1.nearest.of.fatia-pizza.internal. IN AAAA
;; ANSWER SECTION:
gru.top1.nearest.of.fatia-pizza.internal. 5 IN AAAA fdaa:0:6c8d:a7b:23c4:1:3fb6:2
Right, so gru.top1...
is not a valid DNS entry. fly-ruby
doesn’t know about nearest
hostnames. That said, with fly-ruby activated, using fatia-pizza.internal
should be working. If not, fly-ruby
is not hijacking correctly.
What was confusing me was that in FRA, you mentioned you had slow requests when using fra.fatia-pizza.internal
as the URL. That should not happen, as fly-ruby does not activate at all in the primary region.
I’m going to setup an app to test as well. One thing you might try is commenting out workers
and preload_app!
in config/puma.rb
. There may be an issue with forking servers interfering with fly-ruby hijacking.
hmm, so does that mean that fly-ruby
and top1.nearest.of
are incompatible and if you are using fly-ruby
you should use the regular <app>.internal
instead?
Yes - initially I thought this case was covered, but I was wrong. Again, though, the primary region should never be slow regardless of how fly-ruby is setup. If fly-ruby
is activating in the primary region, there’s something really busted
To be clear: using the raw hostname <app>.internal
will give you random IPs. Using top1.nearest
should work when we aren’t having issues like the first one that came up here. <region>.<app>.internal
should always work in the primary region. fly-ruby
should probably, then, make sure to cover all of these cases. I’ll take a look at this and report back.
No, when using fra.fatia-pizza
I don’t have any slowness, but then fly-ruby
doesn’t really work, this is the hostname that it generates
:port=>5433, :database=>"sumiu_web", :host=>"gru.fra.fatia-pizza.internal"
I don’t think gru.fra.fatia-pizza
is a valid DNS since they are at right now resolving the same IP address. Right now, I set fra.fatia-pizza
. no slowness but also I’m not using the GRU replica.
when I use top1.nearest
the db interactions jump from 20ms to 1166 seconds (funny that it is always 1166.6 ms, not sure why) so I’m not really sure what to do (other than not using multi region deploys)
For now, I recommend turning off multi-region deploys until we can fix fly-ruby
to do the right thing here. I’m looking at it now. We’re thinking it should work like:
- Primary region always uses
<region>.<app>.internal
- Secondary uses
top1.nearest.of.<app>.internal
The latter will ensure that a deployment to a region without a replica can still work, even if it’s slower.
cool, will do that
I’m gonna be off for a couple of weeks but I will try to help fly-ruby
if I can
your proposal makes sense, I think it should work like that
Going to merge this now, and test it on my end. Feel free to test as well at your convenience. Fix database connection hijacking by jsierles · Pull Request #22 · superfly/fly-ruby · GitHub
amazing! that was fast hahahah
this is what I get in GRU
:port=>5433, :database=>"sumiu_web", :host=>"top1.nearest.of.fatia-pizza.internal"
and this is what I get on FRA
:port=>5432, :database=>"sumiu_web", :host=>"fra.fatia-pizza.internal"}
slowness is gone, GRU is slightly slower (~20ms), as expected since it has to replay some requests but the 7s delay is gone. gonna leave it up for a couple of hours but apparently, that was it
Nice! Yay for test suites. Are you using Redis or ActionCable in this app?
using Redis, but I don’t have a multi-region setup for it, just a single instance (and I deployed my own Redis setup with a custom Dockerfile)
What are you using it for? cache, sidekiq, both?
kinda both. I have a sidekiq instance running but I don’t have any jobs, I set up just in case. I’m using Redis as storage for rack-attack (that uses Rails.cache
behind the scene)
Interesting. You could try the new managed redis option. See fly redis create
. You can set it up with a replica in gru
, and with a single URL, reads and writes should ‘just work’ as expected without intevention from your app. If rack-attack
reads from Redis on every request, it might help with overall request latency.
I tried but sidekiq exhausted the 10k commands in a blink of an eye lol
will disable sidekiq and try again, maybe have a managed redis just for cache and drop my own and set up sidekiq when/if I need
good idea, @jsierles