fly_postgres questions

fedeotaran · October 26, 2021, 4:43pm

Hello! I was playing a little with the Lib fly_postgres (now fly_postgres_elixir). We are very excited to be able to use this feature

I’m not sure I’m asking in the right place so if I have to move the query to another place, just let me know please!

I configured the application following the documentation.

And deploy my apps on fly:

-- Elixir application
❯ fly status -a bk-app-cluster-test
App
  Name     = bk-app-cluster-test
  Owner    = brandkit
  Version  = 11
  Status   = running
  Hostname = bk-app-cluster-test.fly.dev

Instances
ID       PROCESS VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED
1677433e app     11      scl    run     running 1 total, 1 passing 0        20h27m ago
74a4064c app     11      iad    run     running 1 total, 1 passing 0        20h28m ago
cca31c56 app     11      syd    run     running 1 total, 1 passing 0        20h29m ago

-- Postgres Cluster
❯ fly status -a bk-db-cluster-test
App
  Name     = bk-db-cluster-test
  Owner    = brandkit
  Version  = 4
  Status   = running
  Hostname = bk-db-cluster-test.fly.dev

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED
caa35cbb app     4       syd    run     running (replica) 3 total, 3 passing 0        20h57m ago
83f09243 app     4       scl    run     running (replica) 3 total, 3 passing 0        20h58m ago
25b768de app     4       iad    run     running (leader)  3 total, 3 passing 0        21h0m ago
888e45c1 app     4       iad    run     running (replica) 3 total, 3 passing 0        21h1m ago

The application appears to be slower than the single instance option. I suspect I have something misconfigured.

When I do the write operations, the logs appear duplicated for both instances. These are the logs:

app[1677433e] scl [info] 16:28:47.778 request_id=FrGhg7KVcTMgUd8AAYiR [debug] QUERY OK db=131.9ms idle=1365.0ms
app[1677433e] scl [info] begin []
app[74a4064c] iad [info] 16:28:48.712 [debug] QUERY OK db=0.3ms idle=1743.5ms
app[74a4064c] iad [info] begin []
app[1677433e] scl [info] 16:28:48.712 [debug] QUERY OK db=0.3ms idle=1743.5ms
app[1677433e] scl [info] begin []
app[74a4064c] iad [info] 16:28:48.715 [debug] QUERY OK db=2.1ms
app[74a4064c] iad [info] INSERT INTO "collections" ...
app[74a4064c] iad [info] 16:28:48.716 [debug] QUERY OK db=0.7ms
app[74a4064c] iad [info] INSERT INTO "asset_collections" ...
app[1677433e] scl [info] 16:28:48.715 [debug] QUERY OK db=2.1ms
app[74a4064c] iad [info] 16:28:48.718 [debug] QUERY OK db=0.9ms
app[74a4064c] iad [info] commit []
app[1677433e] scl [info] INSERT INTO "collections" ...
app[1677433e] scl [info] 16:28:48.716 [debug] QUERY OK db=0.7ms
app[1677433e] scl [info] INSERT INTO "asset_collections" ....
app[74a4064c] iad [info] 16:28:48.719 [debug] QUERY OK db=0.3ms queue=0.5ms idle=1749.3ms
app[74a4064c] iad [info] select CAST(pg_current_wal_insert_lsn() AS TEXT) []
app[1677433e] scl [info] 16:28:48.718 [debug] QUERY OK db=0.9ms
app[1677433e] scl [info] commit []
app[1677433e] scl [info] 16:28:48.719 [debug] QUERY OK db=0.3ms queue=0.5ms idle=1749.3ms
app[1677433e] scl [info] select CAST(pg_current_wal_insert_lsn() AS TEXT) []
app[1677433e] scl [info] 16:28:48.182 request_id=FrGhg7KVcTMgUd8AAYiR [debug] QUERY OK source="assets" db=134.9ms

Maybe @Mark can help us?
Do I have any tip to be sure if everything is setup properly?

sudhir.j · October 26, 2021, 5:14pm

This is the right place to ask the question. The setup seems ok, but regarding the duplication on the logs, do you see the data in the database duplicated as well? Are you actually seeing double the number of insertions into the collections and asset_collections tables than you expect?

fedeotaran · October 26, 2021, 5:23pm

Thanks for quick answer!

No, the data is not duplicated.

The only thing that I notice strange (in addition to the logs) is that the queries take much longer than if I do not have the cluster configured (reads and writes).

sudhir.j · October 26, 2021, 5:54pm

Yeah, that tends to happen if your request lands on a replica write region and then has to be forwarded to the primary region. The tradeoff is that reads are much faster because they happen in a region close to you, but writes can be slower because they’re attempted in a region close to you, will fail, and then are re-attempted in the primary region.

sudhir.j · October 26, 2021, 6:03pm

Ah, actually I’m talking about the Rails gem. The Elixir library does a direct RPC call to the primary region to do writes. Which makes the duplicated logs all the more confusing. Let me get a second opinion from @Mark on this.

fedeotaran · October 26, 2021, 6:38pm

But in this case reads are slower too .

Let me show you.
We have the same application also configured with a single app without a Postgres cluster

I did the same requests in both applications from Argentina with k6:

❯ HOSTNAME=main.brandkitapp.com k6 run test_list_assets.js
...
http_req_duration..............: avg=2.95s    min=174.71ms med=358.9ms  max=27.13s p(90)=11.91s   p(95)=17.66s
...

❯ HOSTNAME=bk-app-cluster-test.fly.dev k6 run test_list_assets.js
...
http_req_duration..............: avg=10.94s   min=172.71ms med=3.28s   max=38.94s   p(90)=29.39s   p(95)=31.41s
...

Thanks for the help @sudhir.j!

Mark · October 26, 2021, 6:41pm

@fedeotaran I realized there’s a bug in the fly_postgres library with building the DB connection URL. It isn’t being specific enough about the region and due to internal DNS, would sometimes resolve to the local/fast one and sometime to a distant one. I started seeing a similar issue. I’ll have an update soon.

fedeotaran · October 26, 2021, 6:59pm

Thanks @Mark, let’s wait for the updates then!

Mark · October 26, 2021, 7:46pm

@fedeotaran and @sudhir.j

fly_postgres 1.8 was released. No code changes required (for your app).

This fixed a but where the database URL used for the primary connection wasn’t being explicit about which region to connect to. This could result in slow connections to the primary DB.

Let me know if this helps!

Also, thanks for trying out the library and the great bug report!

fedeotaran · October 27, 2021, 5:09pm

Thanks @Mark!

I tested the new fly_postgres release but I have the same issue.

We still see duplicate logs for writes operations on local and primary regions:

app[6454ec21] scl [info] QUERY OK db=129.1ms queue=0.1ms idle=1061.1ms
app[6454ec21] scl [info] begin []
app[498ccc25] iad [info] QUERY OK db=1.6ms queue=0.9ms idle=1585.2ms
app[498ccc25] iad [info] INSERT INTO "uploads" ...
app[498ccc25] iad [info] QUERY OK db=0.4ms queue=0.7ms idle=1000.9ms
app[498ccc25] iad [info] select CAST(pg_current_wal_insert_lsn() AS TEXT) []
app[6454ec21] scl [info] QUERY OK db=1.6ms queue=0.9ms idle=1585.2ms
app[6454ec21] scl [info] INSERT INTO "uploads" ...
app[6454ec21] scl [info] QUERY OK db=0.4ms queue=0.7ms idle=1000.9ms
app[6454ec21] scl [info] select CAST(pg_current_wal_insert_lsn() AS TEXT) []

These requests are made from Argentina

Is this behavior correct?

We are also doing some load tests, when we finish and analyze the results we will share them with you

Mark · October 27, 2021, 7:43pm

Separate from the duplicate logs question, are the insert/update times resolved?

fedeotaran · October 27, 2021, 7:52pm

Hi , I think I found the problem here.
We are using releases, and don’t have mix_env set.
I tried to add via the fly_cli, but it won’t let me.

I think you have to modify fly_postgres library to not check on MIX_ENV.
Can you confirm?

fedeotaran · October 27, 2021, 7:55pm

I add the env in my Dockerfile for testing.
Now I can’t run migrations because it’s trying to use replica url.

Mark · October 27, 2021, 9:18pm

There are a few different issues we’re talking about. For now I’m going to focus on the migrations problem.

The migrations are only run in the primary region. Here are a few things to check for:

You set PRIMARY_REGION as an ENV
Your primary or leader database is running in that same region
Disable other backup regions for your app.

For the last one, this is what I mean. Run this command:

fly regions backup list

It lists the regions that your application will use as a backup region if there is problem getting into your top picks. The problem here is that you probably don’t have your database running in that backup region! So it will think it needs to connect to the replica DB.

The migrations are run on a copy of your app that probably came up in a backup region. This happens sometimes as part of a normal deploy… it wouldn’t stay running in the backup region, but it might try to deploy and run the migrations from there!

Fortunately, you can turn off the backup regions. So for me, if my app is deployed to lax and syd, I’ll set the backup regions like this…

fly regions backup lax syd

Then running your migrations should work correctly! Let me know if that fixes your migrations issue.

nicanorperera · October 27, 2021, 9:24pm

Hi, @Mark
Thanks for your response.
I’m @fedeotaran coworker.
When running the command I get:

fly regions backup list -a bk-app-cluster-test

Region Pool:
iad
scl
syd
Backup Region:

fedeotaran · October 27, 2021, 9:34pm

. The PRIMARY_REGION is setted on the Dockerfile and I can see on the nodes.
. App and database ragions are the same:

❯ fly status -a bk-app-cluster-test
App
  Name     = bk-app-cluster-test
  Owner    = brandkit
  Version  = 18
  Status   = running
  Hostname = bk-app-cluster-test.fly.dev

Deployment Status
  ID          = 441aa04f-e4f3-649e-8002-4278667d2d76                                                                            
  Version     = v18                                                                                                             
  Status      = failed                                                                                                          
  Description = Failed due to unhealthy allocations - not rolling back to stable job version 18 as current job has same specification
  Instances   = 3 desired, 1 placed, 0 healthy, 1 unhealthy                                                                     

Instances
ID       PROCESS VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED
e8f84de8 app     15      scl    run     running 1 total, 1 passing 0        2h48m ago
3bda52cd app     15      iad    run     running 1 total, 1 passing 0        2h48m ago

❯ fly status -a bk-db-cluster-test
App
  Name     = bk-db-cluster-test
  Owner    = brandkit
  Version  = 4
  Status   = running
  Hostname = bk-db-cluster-test.fly.dev

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED
caa35cbb app     4       syd    run     running (replica) 3 total, 3 passing 0        2021-10-25T17:19:38Z
83f09243 app     4       scl    run     running (replica) 3 total, 3 passing 0        2021-10-25T17:18:24Z
25b768de app     4       iad    run     running (leader)  3 total, 3 passing 0        2021-10-25T17:16:53Z
888e45c1 app     4       iad    run     running (replica) 3 total, 3 passing 0        2021-10-25T17:16:01Z

. We don’t have backup regions like @nicanorperera says

Mark · October 27, 2021, 9:35pm

Cool! That’s good! Just means it’s not going to deploy your apps to a region other than the ones you have specified. That might have been updated for everyone by default already.

I can see from @fedeotaran’s first post that the app and DBs are running in the same regions too. Good!

You can try this to SSH into your app and verify that the ENV is set as expected.

fly ssh console

echo $PRIMARY_REGION

That should return iad since that’s where your DB leader is.

nicanorperera · October 27, 2021, 9:36pm

Yes, exactly.

fly ssh console --app bk-app-cluster-test
Connecting to bk-app-cluster-test.internal... complete
/ # echo $PRIMARY_REGION
iad

Mark · October 27, 2021, 10:00pm

@fedeotaran @nicanorperera

You’ve checked all the boxes! I do have a question about the MIX_ENV set in the Dockerfile.

Can you check that the Dockerfile sets ENV MIX_ENV=prod in the 2nd stage of the deploy? So it should appear 2 times in the Dockerfile. Once for building the release and once just to be present to tell the release, “Hey, you were built using ‘prod’”.

The only other thing I can think of is seeing what these two commands return from inside your app.

Fly.my_region()
Fly.primary_region()

To get an IEx shell to your primary, you can do this:

fly ssh console --app bk-app-cluster-test --select

Then select the option for iad.

Then get an IEx terminal. Specific to your app and release but something like this app/bin/my_app remote.

From within IEx on a node in the primary region, what do those Fly commands return?

This code says "if the primary and the current are the same, then we’re on the primary. But the logs you showed previously when the migration failed said “Replica DB connection - Using replica”. So I’m confused. Since the fly.toml release_command should only be run on the primary.

fedeotaran · October 27, 2021, 10:06pm

Yes! the log is strange because everything seems to be fine

Command output:

Dockerfile, 2nd stage:

Topic		Replies	Views
PG Cluster - Replication lag Questions / Help	9	1068	September 29, 2021
New fly_postgres library released Phoenix elixir , postgres	1	455	June 13, 2023
DB migration for Postgres cluster (in a Phoenix app, using `fly_postgres`)	2	359	July 8, 2023
Fly app Deployment stuck - status in error Build debugging elixir	0	498	April 5, 2022
Strange behavior on syd region Questions / Help elixir , postgres	13	461	March 2, 2022

fly_postgres questions

Related topics