Clarification about regional deployment and postgres databases

jsierles · April 24, 2021, 10:01pm

I’m trying out the regional deployment with a Rails app hitting Postgres. Starting with a single region near me (ams) with both database and web, all was working fine. Then I added a second region (lax). Naturally, the requests hitting LAX responded much slower as they had to send database traffic back to AMS. So a few questions here.

After a second deployment of the web app, it got deployed in SJC and LAX, SJC showing up as a backup of LAX. Shouldn’t the second deployment be in AMS or a backup of it?

Instances
ID       VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED
be1f688e 17      lax    run     running 1 total, 1 passing 0        19m34s ago
6059bef1 17      sjc(B) run     running 1 total, 1 passing 0        20m13s ago

Lebrija:ensayo joshua$ fly regions list
Region Pool:
ams
lax
Backup Region:
fra
lhr
sea
sjc

Also, what would be the winning strategy for the LAX instance accessing postgres? If I setup an LAX replica, will the default behavior be to read from the replica and write to the master in AMS? Forgive my laziness looking up how stolon operates.

kurt · April 25, 2021, 12:04am

It’s probably best to disable backup regions for your app. Just run flyctl regions backup lax sjc.

By default, you’ll connect to port 5432 on the nearest VM. This actually a stolon proxy that forwards you to the current cluster leader. So you’re effectively connecting to ams from everywhere.

You’ll need to use port 5433 to query the replicas. You can also change the connection string to <region>.<pg-app-name>.internal to connect to specific regions.

jsierles · April 25, 2021, 1:03am

OK, thanks. Does this generally mean that any app using a database like Postgres will need some application-level configuration to stay performant? For example, to send reads to the local replicas, while writes still go to the cluster leader.

This made me wonder about your Turboku feature, and if the same issue would arise with regional instances connecting back to Postgres.

jerome · April 25, 2021, 1:08pm

That’s correct!

You’ll probably want to configure 2 pools / connection configurations. One that’s read-only (and fast) and one that’s a write-only (slower). It does require a bit more configuration but it should improve performances a lot.

jsierles · April 25, 2021, 9:44pm

OK, sounds like fun!

Just to revisit one doubt: the docs say:

If there are three regions in the pool and the count is set to six, there will be two app instances in each region.

Why then, with two regions set, did I see an instance in LAX and SJC instead of LAX and AMS (or one of its backups)?

jsierles · April 25, 2021, 10:54pm

OK, I got this all working and it’s pretty impressive! Rails makes it trivial to use read replicas, and benchmarks suggest that now both regions are equally performant on reads.

I did however run into this issue while trying to bring up a second replica in AMS:

2021-04-25T22:52:17.169Z 9f19dfe9 ams [info] keeper            | 2021-04-25 22:52:17.160 UTC [796] FATAL:  hot standby is not possible because max_worker_processes = 1 is a lower setting than on the master server (its value was 8)

It looked like some recent commits might resolve this, but deploying with the latest did not resolve it.

kurt · April 26, 2021, 1:56pm

Oh that’s a definite bug with the most recent PR we merged. We’ll look at that today.

This is a quirk of our backup regions. It doesn’t know how to differentiate between them, it just picks one at random. Since you’re running two regions, you’re way better off setting the backups the same as the primary regions: fly regions backup lax ams

michael · April 26, 2021, 11:20pm

@jsierles did you scale the VM size down at any point?

jsierles · April 27, 2021, 4:35pm

Nope!

kurt · April 27, 2021, 4:36pm

We located this bug, should be fixed today.

michael · April 27, 2021, 11:24pm

@jsierles this should be fixed now. I see you’re running a custom image so I don’t want to change it, but this will get you updated: flyctl deploy -i flyio/postgres-ha:latest

edit: I forgot flyctl deploy -i needs a fly.toml file… if you don’t already have it run flyctl config save then deploy.

jsierles · April 29, 2021, 6:13pm

Great, works!

jsierles · April 30, 2021, 9:45pm

Sometime today, I started seeing errors that the db reachable on port 5432 went read-only. I’m not sure exactly how to debug this, so I am now in the process of rolling my app back to only using the primary for reads. For now I also removed the second region and only have one primary/replica pair in AMS. There, however, the replica seems to be running but is not up-to-date.

Can you help debug? App name is ensayo-db.

kurt · April 30, 2021, 9:57pm

Whoops, I misread the port number! Disregard this one.

–before edit–
Were you connecting to ensayo-db.internal:5432? That hostname gives back all member IP addresses. It is normal for port 5432 on replicas. If this started abruptly, it could just mean our DNS returned IPs in a different order.

You can check fly status to see which DB is currently primary. You can also run fly checks list to get a list of current check statuses, this will show you if the replicas are up to date or not. You’re looking for something like this:

[✓] replication lag: 0s

jsierles · April 30, 2021, 9:59pm

Which address should I use to ensure I always get a leader?

Here’s what I see

Instances
ID       VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED
0448f13c 13      ams    run     running (replica) 3 total, 3 passing 0        22m20s ago
b7896061 13      ams    run     running (leader)  3 total, 3 passing 0        32m51s ago

Lebrija:postgres-ha joshua$ fly checks list
Health Checks for ensayo-db
NAME STATUS  ALLOCATION REGION TYPE   LAST UPDATED OUTPUT
vm   passing 0448f13c   ams    SCRIPT 2m53s ago    [✓] 9.17 GB (93.7%) free space
                                                   on /data/ [✓] load averages:
                                                   0.03 0.05 0.04 [✓] memory:
                                                   0.0s waiting over the last 60s
                                                   [✓] cpu: 1.7s waiting over the
                                                   last 60s [✓] io: 0.0s waiting
                                                   over the last 60s
pg   passing 0448f13c   ams    SCRIPT 13m11s ago   [✓] leader check:
                                                   [fdaa:0:22b7:a7b:aa0:0:1985:2]:5433
                                                   connected [✓] replication
                                                   lag: 0s [✓] proxy check:
                                                   [fdaa:0:22b7:a7b:aa3:0:1984:2]:5432
                                                   connected [✓] connections: 7 used,
                                                   3 reserved, 300 max
role passing 0448f13c   ams    SCRIPT 17m2s ago    replica
vm   passing b7896061   ams    SCRIPT 5m35s ago    [✓] 8.95 GB (91.5%) free space
                                                   on /data/ [✓] load averages:
                                                   0.02 0.07 0.06 [✓] memory:
                                                   0.0s waiting over the last 60s
                                                   [✓] cpu: 2.0s waiting over the
                                                   last 60s [✓] io: 0.0s waiting
                                                   over the last 60s
pg   passing b7896061   ams    SCRIPT 8m10s ago    [✓] replication: currently
                                                   leader [✓] proxy check:
                                                   [fdaa:0:22b7:a7b:aa0:0:1985:2]:5432
                                                   connected [✓] connections: 13 used,
                                                   3 reserved, 300 max
role passing b7896061   ams    SCRIPT 32m38s ago   leader

kurt · April 30, 2021, 10:09pm

Ahhh! I misread that port number. Port 5432 should always proxy you to the leader, which is currently b7896061. Do you know what region you were connecting to when you got read only errors from that?

All your health checks look good but it sounds like something went wonky before you removed your other replica. If you were connecting to port 5432 from another region, it might have just been a temporary network blip keeping it from talking to the primary.

Port 5433 is direct to postgres, my bad!

jsierles · April 30, 2021, 10:10pm

I was previously connecting to a specific region’s 5433 to get the replica for reads only like

ENV['DATABASE_URL'].gsub("ensayo-db.internal:5432", "#{ENV['FLY_REGION']}.ensayo-db.internal:5433")

jsierles · April 30, 2021, 10:11pm

Right now even in AMS, I’m seeing this issue, so now reverting the app to read/write from the master. Let’s see how that goes.

OK - now things are back in business. I’m not sure what happened, but here were the order of events:

AMS ran a primary, LAX replica. App instances in each region connected to their local replica for reads, and port 5432 for writes. This was working fine.

Sometime today, something caused a permanent failure leading to write errors on the master, but I’m guessing this was happening at the application level, as i was able to use a console to make DB updates. I’ll try setting up a staging environment to test this behavior again.

kurt · April 30, 2021, 10:17pm

That’s super weird, those health checks are showing that they can connect to 5432 and write just fine! Can you show me the exact error you got? I wonder if you’re somehow reaching a stale IP address.

jsierles · April 30, 2021, 10:24pm

ActiveRecord::ReadOnlyError (Write query attempted while in readonly mode: UPDATE "people" SET "updated_at" = $1...

So this is a Rails app using the Rails support for splitting reads and writes. It’s possible only certain events were affected that I just didn’t notice until now when they were triggered. So I’ll investigate in a staging env and report back!

Just out of curiosity, what does this log entry mean?

2021-04-30T22:21:48.354Z b7896061 ams [info] sentinel | 2021-04-30T22:21:48.348Z WARN cmd/sentinel.go:276 no keeper info available {"db": "873b4bd0", "keeper": "fdaa022b7a7b8501a202"}

Topic		Replies	Views
Does my app have to be in the same region as my postgres cluster?	4	366	February 21, 2022
How to Change App and DB Region Without Launching New Apps Questions / Help docs , postgres , rails	9	1369	April 27, 2024
Postgres High Availability single vs multi region (leader/replica) postgres	2	672	April 12, 2023
Multi-region postgres + elixir using fly_postgres and fly_rpc Phoenix	11	1201	November 17, 2022
Better understanding best practices for HA for both web apps and PG apps Questions / Help	12	1157	May 2, 2023

Clarification about regional deployment and postgres databases

Related topics