I'm seeing slow ActiveRecord database interactions

Hey :wave:

I’m seeing a craaazy slow AR interactions on my app for a couple of hours. Here is what the app is doing:

signed_content = crypt.encrypt_and_sign(data[:content])
message = Message.create!(content: signed_content, password:, mid:, burn_after_read:)
Success(message:)

That’s pretty much it. but the request is consistently slow, taking 7, 10 seconds

...INFO -- : [92be833b-5d57-4f5f-bd86-b89362bdffcc] Completed 200 OK in 10765ms (Views: 4.3ms | ActiveRecord: 7749.6ms | Allocations: 21196)

I instrumented the app with Skylight like this

instrument("saving") do
  message = Message.create!(content: signed_content, password:, mid:, burn_after_read:)
end

and this is what the response looks like

the replication lag of my postgres app is weird too:

Could this be the cause of the slow AR queries?

Update: I scaled in the Postgres from 2 to 1, disabling the replication and the issue is gone…

Completed 200 OK in 256ms (Views: 0.6ms | ActiveRecord: 4.5ms | Allocations: 1960)

It is the replication lagging behind for sure.

Any idea how to fix it?

Hi @luizkowalski

Can you send me the name of this app so we can take a look on those issues?

hey lubien

the app name is fatia-pizza-app and the pg app is called fatia-pizza.

just to update that I don’t see this issue anymore, but I do see a big repl. lag in the last 6 hours:

Hey @lubien, any news?

I’m still seeing problems now and then. up until now, it was all good, now the issue is back…

Also, I deployed with names, the app is now called sumiu-web while postgres hasn’t changed

Hi @luizkowalski, it seems you’re running this app in both Frankfurt and São Paulo. Is it possible you have app instances trying to reach pg across the world? That would be really slow.

Unlikely for three reasons:

  1. I’m currently in Berlin and I can see the app routing to fra
  2. I have fly-ruby gem and I inspected the config on prod, looks right
  3. the issue goes away when I scaled down the postgres to 1 replica

I’m not entirely sure but I would blame the replication lagging. does that make sense? could a lag be causing it?

It’s unlikely this would be caused by replication lag. That would lead to stale data, but not slow requests.

@luizkowalski Are you using puma or unicorn? If so, are you running more than one process per VM? On puma, this is usually set via WEB_CONCURRENCY which sets the workers in config/puma.rb. By default, I believe you’ll get 2 workers.

If that’s the case, you may be running into this known issue: Connection hijacking in secondary regions must take forking servers into account · Issue #12 · superfly/fly-ruby · GitHub

A quick fix is to set concurrency to 1. I have a fix slated for the core problem, but haven’t gotten around to it.

I will give it a try for sure, thanks a lot @jsierles!

I still wonder why is it that the problem goes away when I scale down the postgres app…

my problem here is that I can’t easily reproduce, sometimes I open the app and it is just slow, I then scale down the postgres app, and boom, the issue is gone, later I scale up and I don’t see the problem for hours until the cycle begins again.

anyway, I will tweak my puma config, maybe disabling workers for good until it is fixed

also, I think it is worth notice that I’m having this issue on my primary region (fra), not on a secondary region (gru)

What does the DATABASE_URL in your environment look like? Is it something like top1.nearest.of.<db>.internal? Or is it just <db>.internal?

The second form will return any postgres instance. So your app in Frankfurt could be connecting to your DB in São Paulo, and then getting routed back to Frankfurt from there.

irb(main):001:0> ENV['DATABASE_URL']
=> "postgres://...@top2.nearest.of.fatia-pizza.internal:5432/sumiu_web"

this is what it looks like actually, in both instances but on gru I see that the fly-ruby does the job and routes to 5433

Do you have two database nodes running in frankfurt or just one? If you only have one, that hostname is returning IPs from both regions.

ohh ok looks like we are getting somewhere…

this is what I have:

postgres:

VM Resources for fatia-pizza
        VM Size: shared-cpu-1x
      VM Memory: 256 MB
          Count: 2
 Max Per Region: Not set

these are the allocations

and two volumes

this morning I recreate the web app and I connected the postgres to it running fly pg attach --postgres-app fatia-pizza -a sumiu-web

this is where the URL came from

should it be top1?

Yeah, top1 will work better for you. We deploy 2 Postgres instances in your primary region for write redundancy, so we do top2 by default. It’s actually best to run 2x fra DBs if you can. Changing to top1 should fix the problem right now though.

If I understand what you said, I should add a new “fra” volume?

anyway, you said “we do top2 by default”. why is that exactly? I’m asking because the docs don’t mention “top2”, it is always “top1” and I got this URL by attaching postgres to my app, so fly actually set “top2”.

when you say “top1 will work better for you” I wonder if there are cases where top2 is better? if so, when?

sorry, lots of questions hahaha trying to get to the bottom of it