Strange behavior on syd region

Hi there!

I notice a strange behavior on our application, but only on syd region.

We have a Phoenix application with a Postgres cluster running on 3 regions:

  • iad (primary)
  • gru
  • syd

The application works as expected on iad and gru, but we notice that some times is really slow on syd.

We have configured AppSignal.
All of these request times are for the same request:

Settings don’t seem to change much between those deployments, so we don’t understand the problem.

If we connect to the interactive terminals in the different regions and execute the same query we can also see the difference in execution time:

We suspect it is a DNS issue in the syd region and this is why the app is sometimes slower.

How could we check the IP our application is connecting to perform the query?
We are using the fly_postgres_elixir package.

Thanks!

Is your DATABASE_URL host configured to top1.nearest.of.yourapp-db.internal ? It sounds like your reads are being routed back to the primary instead of hitting the replica. You can see your full (secret)DATABASE_URL by using fly ssh console, and $ export | grep DATABASE_URL

1 Like

Hi @chrismccord! It’s a pleasure talking to you. :slight_smile:

The DATABASE_URL configured is:

But we are using:

This library configure the URL for each node here

And executing that function in the elixir terminal we have:

Also note that we have the same configuration for node in gru and we don’t have this problem.

That’s why we think it’s a DNS problem. This makes sense?

I think you’re right that it’s connecting to the wrong postgres, but I don’t think DNS is the problem. You can test DNS after you’ve SSHed in by running:

apk add bind-tools
dig aaaa top2.nearest.of.brandkitdb-dev.internal

One potential issue is that top2 returns 2 IP addresses. The second one is not very close to the app server. I don’t think the underlying Erlang bits use the second IP address, but it’s possible!

Our Elixir package should probably use top1, I can’t remember why I encouraged top2. I’m pretty sure this isn’t the problem, though.

We’ll see if we can think of anything else. You could get this behavior if the Fly.Postgres configuration isn’t applied properly. Can you post your Ecto config here?

1 Like

Hi @fedeotaran!

When I was first starting with the fly_postgres library, I would sometimes get different DB connection times. It was stupid frustrating and didn’t make sense. This was back before we had the top2.nearest.of. DNS to help direct it. It was also before I added the specific region to connect to in the DNS… which we’ve since moved away from.

My app was randomly connecting to a PG instance that wasn’t necessarily close. Sometimes it was the close one, sometimes it wasn’t. This reminds me of that. So I am wondering about the top2 part and wondering if top1 is a more reliable option.

1 Like

Hi @kurt! :slight_smile:

Great! This helps a lot!

This is the output:

So the first option is the iad instance? Right? :thinking:

And my configuration is:

config :brandkit, Brandkit.Repo.Local,
    url: System.fetch_env!("DATABASE_URL"),
    socket_options: [:inet6],
    pool_size: String.to_integer(System.get_env("POOL_SIZE", "10")),
    priv: "priv/repo",
    migration_lock: nil,
    queue_target: 5000

The first is Sydney, the next nearest is IAD.

That config looks right! I guess if Erlang is selecting the second IP ever it would cause this problem, so the easiest thing to try here is top1.

Yes! Sorry! Syd!

The top1 is Sydney too.

image

Hi @Mark! :slight_smile:

Maybe is a bug!
What surprises me is that this only happened to me with syd and never with another region :upside_down_face:

My guess is that it may happen with GRU sometimes, but the delay introduced is much less obvious.

I’ll update fly_postgres to use top1 and you can test it out.

1 Like

I just pushed version 0.2.3 that uses top1 instead. Please give it a try and let me know how it goes!

2 Likes

Updated and tested the changes! The app works very fast! :smiley:

image

Thanks guys! You are amazing! @Mark @kurt @chrismccord

1 Like

Great! Love it!

I changed the image! All the request are the same (read requests ~6 queries each)

Thanks for the follow up!

1 Like