Strange behavior on syd region

fedeotaran · March 1, 2022, 7:50pm

Hi there!

I notice a strange behavior on our application, but only on syd region.

We have a Phoenix application with a Postgres cluster running on 3 regions:

iad (primary)
gru
syd

The application works as expected on iad and gru, but we notice that some times is really slow on syd.

We have configured AppSignal.
All of these request times are for the same request:

Settings don’t seem to change much between those deployments, so we don’t understand the problem.

If we connect to the interactive terminals in the different regions and execute the same query we can also see the difference in execution time:

We suspect it is a DNS issue in the syd region and this is why the app is sometimes slower.

How could we check the IP our application is connecting to perform the query?
We are using the fly_postgres_elixir package.

Thanks!

chrismccord · March 1, 2022, 8:04pm

Is your DATABASE_URL host configured to top1.nearest.of.yourapp-db.internal ? It sounds like your reads are being routed back to the primary instead of hitting the replica. You can see your full (secret)DATABASE_URL by using fly ssh console, and $ export | grep DATABASE_URL

fedeotaran · March 1, 2022, 10:56pm

Hi @chrismccord! It’s a pleasure talking to you.

The DATABASE_URL configured is:

But we are using:

This library configure the URL for each node here

And executing that function in the elixir terminal we have:

Also note that we have the same configuration for node in gru and we don’t have this problem.

That’s why we think it’s a DNS problem. This makes sense?

kurt · March 1, 2022, 11:14pm

I think you’re right that it’s connecting to the wrong postgres, but I don’t think DNS is the problem. You can test DNS after you’ve SSHed in by running:

apk add bind-tools
dig aaaa top2.nearest.of.brandkitdb-dev.internal

One potential issue is that top2 returns 2 IP addresses. The second one is not very close to the app server. I don’t think the underlying Erlang bits use the second IP address, but it’s possible!

Our Elixir package should probably use top1, I can’t remember why I encouraged top2. I’m pretty sure this isn’t the problem, though.

We’ll see if we can think of anything else. You could get this behavior if the Fly.Postgres configuration isn’t applied properly. Can you post your Ecto config here?

Mark · March 1, 2022, 11:21pm

Hi @fedeotaran!

When I was first starting with the fly_postgres library, I would sometimes get different DB connection times. It was stupid frustrating and didn’t make sense. This was back before we had the top2.nearest.of. DNS to help direct it. It was also before I added the specific region to connect to in the DNS… which we’ve since moved away from.

My app was randomly connecting to a PG instance that wasn’t necessarily close. Sometimes it was the close one, sometimes it wasn’t. This reminds me of that. So I am wondering about the top2 part and wondering if top1 is a more reliable option.

fedeotaran · March 1, 2022, 11:29pm

Hi @kurt!

Great! This helps a lot!

This is the output:

So the first option is the iad instance? Right?

And my configuration is:

config :brandkit, Brandkit.Repo.Local,
    url: System.fetch_env!("DATABASE_URL"),
    socket_options: [:inet6],
    pool_size: String.to_integer(System.get_env("POOL_SIZE", "10")),
    priv: "priv/repo",
    migration_lock: nil,
    queue_target: 5000

kurt · March 1, 2022, 11:31pm

The first is Sydney, the next nearest is IAD.

That config looks right! I guess if Erlang is selecting the second IP ever it would cause this problem, so the easiest thing to try here is top1.

fedeotaran · March 1, 2022, 11:35pm

Yes! Sorry! Syd!

The top1 is Sydney too.

Hi @Mark!

Maybe is a bug!
What surprises me is that this only happened to me with syd and never with another region

Mark · March 1, 2022, 11:40pm

My guess is that it may happen with GRU sometimes, but the delay introduced is much less obvious.

I’ll update fly_postgres to use top1 and you can test it out.

Mark · March 2, 2022, 12:58am

I just pushed version 0.2.3 that uses top1 instead. Please give it a try and let me know how it goes!

fedeotaran · March 2, 2022, 10:27pm

Updated and tested the changes! The app works very fast!

Thanks guys! You are amazing! @Mark @kurt @chrismccord

Mark · March 2, 2022, 10:28pm

Great! Love it!

fedeotaran · March 2, 2022, 10:31pm

I changed the image! All the request are the same (read requests ~6 queries each)

Mark · March 2, 2022, 10:31pm

Thanks for the follow up!

Topic		Replies	Views
Database connection problem Questions / Help	4	395	May 27, 2022
PG Cluster - Replication lag Questions / Help	9	1092	September 29, 2021
postgres slowness is happening again	41	646	August 24, 2022
Fly Postgres Connections to Replicas Slow Questions / Help postgres , django	18	725	December 21, 2023
Multi-region postgres latency issues	5	815	January 8, 2022

Strange behavior on syd region

Related topics