Postgres replication seems to fail

wowjesus · September 28, 2021, 6:02pm

Hello,
We run a globally scaled image hosting service and we recently noticed that if someone from Syndey uploads something, it will appear in the lead database, but it won’t appear across the replicas (so the user won’t see it on their dashboard). Internally our backend uses 2 Prisma (ORM) instances, one for read and one for write. Write always connects to the leading cluster in Amsterdam and read connects to the current region’s replica (this eliminates the need for the fly-replay header, if I understood the docs correctly).
Has anyone experienced an issue similar to this and if so how did you solve it?
Thanks in advance

kurt · September 28, 2021, 6:25pm

How large are these uploads? And how exactly are you writing to the primary database?

If it’s a relatively large file, there’s probably just some replication lag between when it gets written and when the replica is up to date. Our Ruby library handles this by sending requests to the primary region for 5s after a write, but there are other techniques that could work too.

If you run fly checks list -a <postgres-name> you can see replication lag check status, too.

wowjesus · September 28, 2021, 6:29pm

-How large are these uploads?
The uploads are being uploaded to AWS, before the write request to the db even begins. (but most of them are a few kilobytes, mostly screenshots)
-And how exactly are you writing to the primary database?
Not sure what you mean here, here is a link to Prisma for reference https://www.prisma.io/
-If you run…
Didn’t know about that, thank you.

kurt · September 28, 2021, 6:34pm

Oh I missed the note about read/write connections on Prisma. This is a hard problem, but the typical trick is to read from the write connection after an upload.

Does this clear up after a few moments? Also, silly question, but is the sydney instance of your database healthy? fly status should show you, at least.

Assuming it’s healthy and the data shows up after a refresh, you’ll need to build some logic to work around replication lag.

The fly-replay approach solves this for most people. It would be worth experimenting with unless you have a reason to do writes directly to postgres over a long distance.

wowjesus · September 28, 2021, 6:40pm

-Does this clear up after a few moments?
I just checked by switching to a VPN in Melbourne (I’m in Hungary, so I’m connected to the leading cluster), and no it doesn’t, no data seems to replicate.

Is the sydney instance of your database healthy
Yes
The fly-replay approach solves this for most people.
Main problem is the fact that Prisma abstracts the database logic, and node doesn’t even connect to the DB, the underlying Rust engine does, but if there is no other solution, I might try to figure out something.

kurt · September 28, 2021, 6:43pm

We have the beginnings of a node library that makes handling fly-replay almost seamless. Are you using Express or Fastify: Fly PG Read Replica Multi-Region Clusters with Prisma / Node - #4 by joshua

I’m going to look at your database to see what’s up. Replication should not be failing unless the DB instance is failing somehow (and you’d see health checks for that).

wowjesus · September 28, 2021, 6:48pm

Are you using Express or Fastify
We are using Nestjs with the Fastify adapter.
I’m going to look at your database to see what’s up.
Thank you
We have the beginnings of a node library that makes handling fly-replay almost seamless.
Looking forward to use/contribute

shaun · September 28, 2021, 7:55pm

@wowjesus I noticed that your app is running an older image that may be contributing to the health check failures i’m seeing. Do you mind if I push through an update to get your app on our latest image? The latest image also contains improved replication lag checks, which would be useful to reference in this case.

wowjesus · September 29, 2021, 5:07am

Sorry for the delay in the reply I took a quick nap and no I absolutely don’t, thank you!

shaun · September 29, 2021, 1:51pm

No problem at all! I went ahead and pushed through the image update and looks like it cleared the failing health checks. You should now see the new replication lag health checks by running fly checks list.

Also, you can get a more detailed view of the state of replication by running select * from pg_stat_replication; against master.

Hope that helps!

wowjesus · September 29, 2021, 2:51pm

Thank you so much, the issues seems to be solved, appreciate the fast replies and help!

wowjesus · September 29, 2021, 2:51pm

Also one more question, where can I see if a new version of the pg image gets pushed?
Github?

shaun · September 29, 2021, 3:13pm

Also one more question, where can I see if a new version of the pg image gets pushed?
Github?

That’s a great question. While you could monitor our Github repo for new releases, there’s currently no way for you to know which version you’re currently on. That being said, this is a problem that I am actively working on and i’m hoping to have this addressed within the next week or two. So be on the lookout.

wowjesus · September 29, 2021, 3:26pm

I will, thank you.

Topic		Replies	Views
Distributed Postgres conflict resolution	6	464	May 31, 2021
Multi-region postgres deployment and consistency	9	608	July 16, 2021
Write-heavy fly postgres database with replication Questions / Help logs , distributed , postgres	5	141	September 9, 2024
Rails ActiveStorage variants writes to PG not working Questions / Help postgres , rails	9	1062	December 26, 2022
Increasing replication lag on Postgres (single region) Questions / Help postgres	2	374	November 23, 2021

Postgres replication seems to fail

Related topics