Clarification about regional deployment and postgres databases

I’m trying out the regional deployment with a Rails app hitting Postgres. Starting with a single region near me (ams) with both database and web, all was working fine. Then I added a second region (lax). Naturally, the requests hitting LAX responded much slower as they had to send database traffic back to AMS. So a few questions here.

After a second deployment of the web app, it got deployed in SJC and LAX, SJC showing up as a backup of LAX. Shouldn’t the second deployment be in AMS or a backup of it?

Instances
ID       VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED
be1f688e 17      lax    run     running 1 total, 1 passing 0        19m34s ago
6059bef1 17      sjc(B) run     running 1 total, 1 passing 0        20m13s ago

Lebrija:ensayo joshua$ fly regions list
Region Pool:
ams
lax
Backup Region:
fra
lhr
sea
sjc

Also, what would be the winning strategy for the LAX instance accessing postgres? If I setup an LAX replica, will the default behavior be to read from the replica and write to the master in AMS? Forgive my laziness looking up how stolon operates.

It’s probably best to disable backup regions for your app. Just run flyctl regions backup lax sjc.

By default, you’ll connect to port 5432 on the nearest VM. This actually a stolon proxy that forwards you to the current cluster leader. So you’re effectively connecting to ams from everywhere.

You’ll need to use port 5433 to query the replicas. You can also change the connection string to <region>.<pg-app-name>.internal to connect to specific regions.

1 Like

OK, thanks. Does this generally mean that any app using a database like Postgres will need some application-level configuration to stay performant? For example, to send reads to the local replicas, while writes still go to the cluster leader.

This made me wonder about your Turboku feature, and if the same issue would arise with regional instances connecting back to Postgres.

1 Like

That’s correct!

You’ll probably want to configure 2 pools / connection configurations. One that’s read-only (and fast) and one that’s a write-only (slower). It does require a bit more configuration but it should improve performances a lot.

1 Like

OK, sounds like fun!

Just to revisit one doubt: the docs say:

If there are three regions in the pool and the count is set to six, there will be two app instances in each region.

Why then, with two regions set, did I see an instance in LAX and SJC instead of LAX and AMS (or one of its backups)?

OK, I got this all working and it’s pretty impressive! Rails makes it trivial to use read replicas, and benchmarks suggest that now both regions are equally performant on reads.

I did however run into this issue while trying to bring up a second replica in AMS:

2021-04-25T22:52:17.169Z 9f19dfe9 ams [info] keeper            | 2021-04-25 22:52:17.160 UTC [796] FATAL:  hot standby is not possible because max_worker_processes = 1 is a lower setting than on the master server (its value was 8)

It looked like some recent commits might resolve this, but deploying with the latest did not resolve it.

Oh that’s a definite bug with the most recent PR we merged. We’ll look at that today.

This is a quirk of our backup regions. It doesn’t know how to differentiate between them, it just picks one at random. Since you’re running two regions, you’re way better off setting the backups the same as the primary regions: fly regions backup lax ams

@jsierles did you scale the VM size down at any point?

Nope!

We located this bug, should be fixed today.

1 Like

@jsierles this should be fixed now. I see you’re running a custom image so I don’t want to change it, but this will get you updated: flyctl deploy -i flyio/postgres-ha:latest

edit: I forgot flyctl deploy -i needs a fly.toml file… if you don’t already have it run flyctl config save then deploy.

Great, works!

1 Like

Sometime today, I started seeing errors that the db reachable on port 5432 went read-only. I’m not sure exactly how to debug this, so I am now in the process of rolling my app back to only using the primary for reads. For now I also removed the second region and only have one primary/replica pair in AMS. There, however, the replica seems to be running but is not up-to-date.

Can you help debug? App name is ensayo-db.

Whoops, I misread the port number! Disregard this one.

–before edit–
Were you connecting to ensayo-db.internal:5432? That hostname gives back all member IP addresses. It is normal for port 5432 on replicas. If this started abruptly, it could just mean our DNS returned IPs in a different order.

You can check fly status to see which DB is currently primary. You can also run fly checks list to get a list of current check statuses, this will show you if the replicas are up to date or not. You’re looking for something like this:

[✓] replication lag: 0s

Which address should I use to ensure I always get a leader?

Here’s what I see

Instances
ID       VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED
0448f13c 13      ams    run     running (replica) 3 total, 3 passing 0        22m20s ago
b7896061 13      ams    run     running (leader)  3 total, 3 passing 0        32m51s ago

Lebrija:postgres-ha joshua$ fly checks list
Health Checks for ensayo-db
NAME STATUS  ALLOCATION REGION TYPE   LAST UPDATED OUTPUT
vm   passing 0448f13c   ams    SCRIPT 2m53s ago    [✓] 9.17 GB (93.7%) free space
                                                   on /data/ [✓] load averages:
                                                   0.03 0.05 0.04 [✓] memory:
                                                   0.0s waiting over the last 60s
                                                   [✓] cpu: 1.7s waiting over the
                                                   last 60s [✓] io: 0.0s waiting
                                                   over the last 60s
pg   passing 0448f13c   ams    SCRIPT 13m11s ago   [✓] leader check:
                                                   [fdaa:0:22b7:a7b:aa0:0:1985:2]:5433
                                                   connected [✓] replication
                                                   lag: 0s [✓] proxy check:
                                                   [fdaa:0:22b7:a7b:aa3:0:1984:2]:5432
                                                   connected [✓] connections: 7 used,
                                                   3 reserved, 300 max
role passing 0448f13c   ams    SCRIPT 17m2s ago    replica
vm   passing b7896061   ams    SCRIPT 5m35s ago    [✓] 8.95 GB (91.5%) free space
                                                   on /data/ [✓] load averages:
                                                   0.02 0.07 0.06 [✓] memory:
                                                   0.0s waiting over the last 60s
                                                   [✓] cpu: 2.0s waiting over the
                                                   last 60s [✓] io: 0.0s waiting
                                                   over the last 60s
pg   passing b7896061   ams    SCRIPT 8m10s ago    [✓] replication: currently
                                                   leader [✓] proxy check:
                                                   [fdaa:0:22b7:a7b:aa0:0:1985:2]:5432
                                                   connected [✓] connections: 13 used,
                                                   3 reserved, 300 max
role passing b7896061   ams    SCRIPT 32m38s ago   leader

Ahhh! I misread that port number. Port 5432 should always proxy you to the leader, which is currently b7896061. Do you know what region you were connecting to when you got read only errors from that?

All your health checks look good but it sounds like something went wonky before you removed your other replica. If you were connecting to port 5432 from another region, it might have just been a temporary network blip keeping it from talking to the primary.

Port 5433 is direct to postgres, my bad!

I was previously connecting to a specific region’s 5433 to get the replica for reads only like

ENV['DATABASE_URL'].gsub("ensayo-db.internal:5432", "#{ENV['FLY_REGION']}.ensayo-db.internal:5433")

Right now even in AMS, I’m seeing this issue, so now reverting the app to read/write from the master. Let’s see how that goes.

OK - now things are back in business. I’m not sure what happened, but here were the order of events:

AMS ran a primary, LAX replica. App instances in each region connected to their local replica for reads, and port 5432 for writes. This was working fine.

Sometime today, something caused a permanent failure leading to write errors on the master, but I’m guessing this was happening at the application level, as i was able to use a console to make DB updates. I’ll try setting up a staging environment to test this behavior again.

That’s super weird, those health checks are showing that they can connect to 5432 and write just fine! Can you show me the exact error you got? I wonder if you’re somehow reaching a stale IP address.

ActiveRecord::ReadOnlyError (Write query attempted while in readonly mode: UPDATE "people" SET "updated_at" = $1...

So this is a Rails app using the Rails support for splitting reads and writes. It’s possible only certain events were affected that I just didn’t notice until now when they were triggered. So I’ll investigate in a staging env and report back!

Just out of curiosity, what does this log entry mean?

2021-04-30T22:21:48.354Z b7896061 ams [info] sentinel | 2021-04-30T22:21:48.348Z WARN cmd/sentinel.go:276 no keeper info available {"db": "873b4bd0", "keeper": "fdaa022b7a7b8501a202"}