PG Cluster - Replication lag

fedeotaran · September 23, 2021, 10:57pm

Hello!

We want to use fly to migrate our Elixir application.
We were able to configure our application as a cluster but we want to do the same for the Postgres application.

I created Postgres Cluster on two regions but the application are really slow.

❯ fly status -a brandkit-pg-st
App
  Name     = brandkit-pg-st
  Owner    = brandkit
  Version  = 4
  Status   = running
  Hostname = brandkit-pg-st.fly.dev

Deployment Status
  ID          = e20569c0-f788-c5c1-ddcb-26aea1dfe55c
  Version     = v4
  Status      = successful
  Description = Deployment completed successfully
  Instances   = 3 desired, 3 placed, 3 healthy, 0 unhealthy

Instances
ID       TASK VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED
ede4b9e2 app  4       gru    run     running (replica) 3 total, 3 passing 0        2h29m ago
cc6ceb2b app  4       iad    run     running (replica) 3 total, 3 passing 0        2h30m ago
a192ee2e app  4       iad    run     running (leader)  3 total, 3 passing 0        2h31m ago

On the board we see that the replication delay increases all the time.

I scale the vm-size and have the same behaviour.

Could you help us understand what happens?

Thanks!

kurt · September 24, 2021, 2:00pm

When you run a database multi region, you need to do some work to make sure your app connects to the nearest. Your app instance in São Paulo is probably connecting to the database in Virginia.

This will get easier, but here are docs for connecting to postgres in the same region: Multi-region PostgreSQL

Your best bet is probably to run app and database in a single region at first, and then add regions when you’re comfortable with performance. We’ll have a hex package to automate what you need in Elixir soon.

fedeotaran · September 24, 2021, 2:48pm

Thanks for the answer!

The strategy we used was to create 2 ecto repo, one for the leader (read / write) and one for the replicas (read only). If the node is in the same region as the leader, both repositories will have the same database url.

The configuration looks:

fly ssh console -s -a shy-rain-1653
? Select instance: gru.shy-rain-1653.internal
Connecting to [fdaa:0:332b:a7b:1f60:5836:bb9a:2]... complete
/ # /app/bin/brandkit remote
Erlang/OTP 24 [erts-12.0.2] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:1] [jit:no-native-stack]

Interactive Elixir (1.12.1) - press Ctrl+C to exit (type h() ENTER for help)
iex(shy-rain-1653@fdaa:0:332b:a7b:1f60:5836:bb9a:2)1> Application.get_env(:brandkit, Brandkit.Repo)
[
  url: "postgres://xxx:xxx@brandkit-pg-st.internal:5432/shy_rain_1653?sslmode=disable",
  socket_options: [:inet6],
  pool_size: 10
]
iex(shy-rain-1653@fdaa:0:332b:a7b:1f60:5836:bb9a:2)2> Application.get_env(:brandkit, Brandkit.RepoReadOnly)
[
  url: "postgres://xxx:xxx@gru.brandkit-pg-st.internal:5432/shy_rain_1653?sslmode=disable",
  socket_options: [:inet6],
  pool_size: 10
]
iex(shy-rain-1653@fdaa:0:332b:a7b:1f60:5836:bb9a:2)3>

We explicitly use this RepoReadOnly in some queries. We do not all migrate for now.

I followed that post High Availability & Global Replication · Fly Docs and took some ideas from this Multi region database guide discussion.

The difference in the time of the requests is big.

GRU (using replica and leader)
app[33dcfef1] gru [info] 19:37:29.017 request_id=FqWzLO6dfiNa7ZoAAAXh [info] Sent 200 in 3462ms

IAD (using leader)
app[2e575cad] iad [info] 19:51:03.302 request_id=FqWz60zLOpGGZIIAAAbh [info] Sent 200 in 122ms

These requests only have read operations, but they use both repo.

We’re sure that our actual configuration is not the best, so it’s going to be awesome!

We will continue testing, we just want to share our experience and try to better understand how is the correct way to use our PG cluster.

Thanks!

kurt · September 24, 2021, 2:54pm

For the writer, you’ll be better off with a regional URL. brandkit-pg-st.internal could actually be connecting you to any region.

postgres://xxx:xxx@iad.brandkit-pg-st.internal:5432

More generally, querying a Postgres in Virginia from São Paulo will always be slow. If you’re doing async writes for something like LiveView that might be ok. If you’re doing it within an HTTP request, though, it will break. You are much, much better off using the fly-replay header when you need your code to work with the primary database.

kurt · September 24, 2021, 6:18pm

Oh, also, that lag metric is deceptive. It is “time since last write”. If your DB isn’t getting writes, it will continue to increase.

fedeotaran · September 24, 2021, 7:26pm

Oh! I see. To be sure.

You say it is always better for the application to connect to the database from the same region.

With this scenario:

Is option 1 better over option 2?

Option 1 is implemented with the fly-replay header and option 2 is our actual configuration.

–

Ohhh! That explains a lot of things!

Thanks!

kurt · September 24, 2021, 7:51pm

Option one is probably better. It’s better to have the code that does a write close to the database instance. For HTTP requests, I would do all reads from Brazil and replay all write requests to Virginia.

We have some details on what we’ve seen here: Globally Distributed Postgres · Fly

@mark-1 is working on a way to RPC code to the primary region, as well, so that’ll be an option soon!

fedeotaran · September 27, 2021, 1:11pm

Awesome! We will be waiting for that code

Thanks for the explanation, patience and kindness.

Mark · September 27, 2021, 9:59pm

Yes! I’m hoping to share the code as a library soon-ish here in the community. It’s an approach that helps move all the writes to the primary and let the reads happen on the replica. Many typical apps are read-heavy, so for those it makes a lot of sense. If that model fits well with your app then I think it will be a really great solution. I’m also presenting on it at Elixir Conf in October. So I’m really working to finish it up!!

fedeotaran · September 29, 2021, 1:07am

We look forward to that talk!

Topic		Replies	Views
fly_postgres questions Phoenix	44	1996	November 17, 2022
Fly Postgres Cluster on V1 Questions / Help elixir , postgres	12	532	November 19, 2022
Multi-region postgres + elixir using fly_postgres and fly_rpc Phoenix	11	1212	November 17, 2022
Strange behavior on syd region Questions / Help elixir , postgres	13	471	March 2, 2022
DB migration for Postgres cluster (in a Phoenix app, using `fly_postgres`)	2	393	July 8, 2023

PG Cluster - Replication lag

Related topics