fly_postgres questions

Hello! I was playing a little with the Lib fly_postgres (now fly_postgres_elixir). We are very excited to be able to use this feature :slight_smile:

I’m not sure I’m asking in the right place so if I have to move the query to another place, just let me know please! :smiley:

I configured the application following the documentation.

And deploy my apps on fly:

-- Elixir application
❯ fly status -a bk-app-cluster-test
App
  Name     = bk-app-cluster-test
  Owner    = brandkit
  Version  = 11
  Status   = running
  Hostname = bk-app-cluster-test.fly.dev

Instances
ID       PROCESS VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED
1677433e app     11      scl    run     running 1 total, 1 passing 0        20h27m ago
74a4064c app     11      iad    run     running 1 total, 1 passing 0        20h28m ago
cca31c56 app     11      syd    run     running 1 total, 1 passing 0        20h29m ago
-- Postgres Cluster
❯ fly status -a bk-db-cluster-test
App
  Name     = bk-db-cluster-test
  Owner    = brandkit
  Version  = 4
  Status   = running
  Hostname = bk-db-cluster-test.fly.dev

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED
caa35cbb app     4       syd    run     running (replica) 3 total, 3 passing 0        20h57m ago
83f09243 app     4       scl    run     running (replica) 3 total, 3 passing 0        20h58m ago
25b768de app     4       iad    run     running (leader)  3 total, 3 passing 0        21h0m ago
888e45c1 app     4       iad    run     running (replica) 3 total, 3 passing 0        21h1m ago

The application appears to be slower than the single instance option. :confused: I suspect I have something misconfigured.

When I do the write operations, the logs appear duplicated for both instances. These are the logs:

app[1677433e] scl [info] 16:28:47.778 request_id=FrGhg7KVcTMgUd8AAYiR [debug] QUERY OK db=131.9ms idle=1365.0ms
app[1677433e] scl [info] begin []
app[74a4064c] iad [info] 16:28:48.712 [debug] QUERY OK db=0.3ms idle=1743.5ms
app[74a4064c] iad [info] begin []
app[1677433e] scl [info] 16:28:48.712 [debug] QUERY OK db=0.3ms idle=1743.5ms
app[1677433e] scl [info] begin []
app[74a4064c] iad [info] 16:28:48.715 [debug] QUERY OK db=2.1ms
app[74a4064c] iad [info] INSERT INTO "collections" ...
app[74a4064c] iad [info] 16:28:48.716 [debug] QUERY OK db=0.7ms
app[74a4064c] iad [info] INSERT INTO "asset_collections" ...
app[1677433e] scl [info] 16:28:48.715 [debug] QUERY OK db=2.1ms
app[74a4064c] iad [info] 16:28:48.718 [debug] QUERY OK db=0.9ms
app[74a4064c] iad [info] commit []
app[1677433e] scl [info] INSERT INTO "collections" ...
app[1677433e] scl [info] 16:28:48.716 [debug] QUERY OK db=0.7ms
app[1677433e] scl [info] INSERT INTO "asset_collections" ....
app[74a4064c] iad [info] 16:28:48.719 [debug] QUERY OK db=0.3ms queue=0.5ms idle=1749.3ms
app[74a4064c] iad [info] select CAST(pg_current_wal_insert_lsn() AS TEXT) []
app[1677433e] scl [info] 16:28:48.718 [debug] QUERY OK db=0.9ms
app[1677433e] scl [info] commit []
app[1677433e] scl [info] 16:28:48.719 [debug] QUERY OK db=0.3ms queue=0.5ms idle=1749.3ms
app[1677433e] scl [info] select CAST(pg_current_wal_insert_lsn() AS TEXT) []
app[1677433e] scl [info] 16:28:48.182 request_id=FrGhg7KVcTMgUd8AAYiR [debug] QUERY OK source="assets" db=134.9ms

Maybe @Mark can help us?
Do I have any tip to be sure if everything is setup properly?

1 Like

This is the right place to ask the question. The setup seems ok, but regarding the duplication on the logs, do you see the data in the database duplicated as well? Are you actually seeing double the number of insertions into the collections and asset_collections tables than you expect?

Thanks for quick answer! :slight_smile:

No, the data is not duplicated.

The only thing that I notice strange (in addition to the logs) is that the queries take much longer than if I do not have the cluster configured (reads and writes).

Yeah, that tends to happen if your request lands on a replica write region and then has to be forwarded to the primary region. The tradeoff is that reads are much faster because they happen in a region close to you, but writes can be slower because they’re attempted in a region close to you, will fail, and then are re-attempted in the primary region.

Ah, actually I’m talking about the Rails gem. The Elixir library does a direct RPC call to the primary region to do writes. Which makes the duplicated logs all the more confusing. Let me get a second opinion from @Mark on this.

2 Likes

But in this case reads are slower too :confused:.

Let me show you.
We have the same application also configured with a single app without a Postgres cluster

I did the same requests in both applications from Argentina with k6:

❯ HOSTNAME=main.brandkitapp.com k6 run test_list_assets.js
...
http_req_duration..............: avg=2.95s    min=174.71ms med=358.9ms  max=27.13s p(90)=11.91s   p(95)=17.66s
...
❯ HOSTNAME=bk-app-cluster-test.fly.dev k6 run test_list_assets.js
...
http_req_duration..............: avg=10.94s   min=172.71ms med=3.28s   max=38.94s   p(90)=29.39s   p(95)=31.41s
...

Thanks for the help @sudhir.j!

@fedeotaran I realized there’s a bug in the fly_postgres library with building the DB connection URL. It isn’t being specific enough about the region and due to internal DNS, would sometimes resolve to the local/fast one and sometime to a distant one. I started seeing a similar issue. I’ll have an update soon.

2 Likes

Thanks @Mark, let’s wait for the updates then! :smiley:

@fedeotaran and @sudhir.j

fly_postgres 1.8 was released. No code changes required (for your app).

This fixed a but where the database URL used for the primary connection wasn’t being explicit about which region to connect to. This could result in slow connections to the primary DB.

Let me know if this helps!

Also, thanks for trying out the library and the great bug report!

3 Likes

Thanks @Mark!

I tested the new fly_postgres release but I have the same issue.

We still see duplicate logs for writes operations on local and primary regions:

app[6454ec21] scl [info] QUERY OK db=129.1ms queue=0.1ms idle=1061.1ms
app[6454ec21] scl [info] begin []
app[498ccc25] iad [info] QUERY OK db=1.6ms queue=0.9ms idle=1585.2ms
app[498ccc25] iad [info] INSERT INTO "uploads" ...
app[498ccc25] iad [info] QUERY OK db=0.4ms queue=0.7ms idle=1000.9ms
app[498ccc25] iad [info] select CAST(pg_current_wal_insert_lsn() AS TEXT) []
app[6454ec21] scl [info] QUERY OK db=1.6ms queue=0.9ms idle=1585.2ms
app[6454ec21] scl [info] INSERT INTO "uploads" ...
app[6454ec21] scl [info] QUERY OK db=0.4ms queue=0.7ms idle=1000.9ms
app[6454ec21] scl [info] select CAST(pg_current_wal_insert_lsn() AS TEXT) []

These requests are made from Argentina

Is this behavior correct?

We are also doing some load tests, when we finish and analyze the results we will share them with you :slight_smile:

2 Likes

Separate from the duplicate logs question, are the insert/update times resolved?

2 Likes

Hi :smiley: , I think I found the problem here.
We are using releases, and don’t have mix_env set.
I tried to add via the fly_cli, but it won’t let me.

I think you have to modify fly_postgres library to not check on MIX_ENV.
Can you confirm?

I add the env in my Dockerfile for testing.
Now I can’t run migrations because it’s trying to use replica url.

There are a few different issues we’re talking about. For now I’m going to focus on the migrations problem.

The migrations are only run in the primary region. Here are a few things to check for:

  • You set PRIMARY_REGION as an ENV
  • Your primary or leader database is running in that same region
  • Disable other backup regions for your app.

For the last one, this is what I mean. Run this command:

fly regions backup list

It lists the regions that your application will use as a backup region if there is problem getting into your top picks. The problem here is that you probably don’t have your database running in that backup region! So it will think it needs to connect to the replica DB.

The migrations are run on a copy of your app that probably came up in a backup region. This happens sometimes as part of a normal deploy… it wouldn’t stay running in the backup region, but it might try to deploy and run the migrations from there!

Fortunately, you can turn off the backup regions. So for me, if my app is deployed to lax and syd, I’ll set the backup regions like this…

fly regions backup lax syd

Then running your migrations should work correctly! Let me know if that fixes your migrations issue.

1 Like

Hi, @Mark
Thanks for your response.
I’m @fedeotaran coworker.
When running the command I get:

fly regions backup list -a bk-app-cluster-test

Region Pool:
iad
scl
syd
Backup Region:

:white_check_mark: . The PRIMARY_REGION is setted on the Dockerfile and I can see on the nodes.
:white_check_mark: . App and database ragions are the same:

❯ fly status -a bk-app-cluster-test
App
  Name     = bk-app-cluster-test
  Owner    = brandkit
  Version  = 18
  Status   = running
  Hostname = bk-app-cluster-test.fly.dev

Deployment Status
  ID          = 441aa04f-e4f3-649e-8002-4278667d2d76                                                                            
  Version     = v18                                                                                                             
  Status      = failed                                                                                                          
  Description = Failed due to unhealthy allocations - not rolling back to stable job version 18 as current job has same specification
  Instances   = 3 desired, 1 placed, 0 healthy, 1 unhealthy                                                                     

Instances
ID       PROCESS VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED
e8f84de8 app     15      scl    run     running 1 total, 1 passing 0        2h48m ago
3bda52cd app     15      iad    run     running 1 total, 1 passing 0        2h48m ago
❯ fly status -a bk-db-cluster-test
App
  Name     = bk-db-cluster-test
  Owner    = brandkit
  Version  = 4
  Status   = running
  Hostname = bk-db-cluster-test.fly.dev

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED
caa35cbb app     4       syd    run     running (replica) 3 total, 3 passing 0        2021-10-25T17:19:38Z
83f09243 app     4       scl    run     running (replica) 3 total, 3 passing 0        2021-10-25T17:18:24Z
25b768de app     4       iad    run     running (leader)  3 total, 3 passing 0        2021-10-25T17:16:53Z
888e45c1 app     4       iad    run     running (replica) 3 total, 3 passing 0        2021-10-25T17:16:01Z

:white_check_mark: . We don’t have backup regions like @nicanorperera says

Cool! That’s good! Just means it’s not going to deploy your apps to a region other than the ones you have specified. That might have been updated for everyone by default already. :smile:

I can see from @fedeotaran’s first post that the app and DBs are running in the same regions too. Good!

You can try this to SSH into your app and verify that the ENV is set as expected.

fly ssh console

echo $PRIMARY_REGION

That should return iad since that’s where your DB leader is.

2 Likes

Yes, exactly.

fly ssh console --app bk-app-cluster-test
Connecting to bk-app-cluster-test.internal... complete
/ # echo $PRIMARY_REGION
iad

@fedeotaran @nicanorperera

You’ve checked all the boxes! I do have a question about the MIX_ENV set in the Dockerfile.

Can you check that the Dockerfile sets ENV MIX_ENV=prod in the 2nd stage of the deploy? So it should appear 2 times in the Dockerfile. Once for building the release and once just to be present to tell the release, “Hey, you were built using ‘prod’”.

The only other thing I can think of is seeing what these two commands return from inside your app.

Fly.my_region()
Fly.primary_region()

To get an IEx shell to your primary, you can do this:

fly ssh console --app bk-app-cluster-test --select

Then select the option for iad.

Then get an IEx terminal. Specific to your app and release but something like this app/bin/my_app remote.

From within IEx on a node in the primary region, what do those Fly commands return?

This code says "if the primary and the current are the same, then we’re on the primary. But the logs you showed previously when the migration failed said “Replica DB connection - Using replica”. So I’m confused. Since the fly.toml release_command should only be run on the primary.

Yes! the log is strange because everything seems to be fine :thinking:

Command output:

Dockerfile, 2nd stage: