fly_postgres questions

Oh!

I think you need to remove the ENV PRIMARY_REGION=iad from your Dockerfile. Put it in your fly.toml file.

[env]
  PRIMARY_REGION = "iad"

That way things outside the Dockerfile know what your primary region is. Can you try that?

1 Like

Ohhh!

Yes! Migrations ran successfully! But we have an error with the syd instance.
It seems to be a totally different problem. Right?

Command: /app/bin/brandkit eval Brandkit.Release.migrate
         Starting instance
         Configuring virtual machine
         Pulling container image
         Unpacking image
         Preparing kernel init
         Configuring firecracker
         Starting virtual machine
         2021/10/27 22:13:08 listening on [fdaa:0:332b:a7b:21e0:8630:1bfc:2]:22 (DNS: [fdaa::3]:53)
         22:13:08.848 [info] Primary DB connection - Running in primary region
         22:13:08.893 [info] Primary DB connection - Running in primary region
         22:13:08.922 [info] Primary DB connection - Running in primary region
         22:13:08.930 [info] Primary DB connection - Running in primary region
         Reaped child process with pid: 576 and signal: SIGUSR1, core dumped? false
         22:13:10.076 [info] Migrations already up
         Rain child exited normally with code: 0
         Seaped child process with pid: 578 and signal: SIGUSR1, core dumped? false
         Starting clean up.
Monitoring Deployment

3 desired, 1 placed, 0 healthy, 1 unhealthy [restarts: 2] [health checks: 1 total, 1 critical]
v19 failed - Failed due to unhealthy allocations - not rolling back to stable job version 19 as current job has same specification
Failed Instances

==> Failure #1

Instance
  ID            = a9c1e82b
  Process       =
  Version       = 19
  Region        = syd
  Desired       = run
  Status        = running
  Health Checks = 1 total, 1 critical
  Restarts      = 2
  Created       = 1m12s ago

Recent Events
TIMESTAMP            TYPE       MESSAGE
2021-10-27T22:13:22Z Received   Task received by client
2021-10-27T22:13:22Z Task Setup Building Task Directory
2021-10-27T22:13:36Z Started    Task started by client
2021-10-27T22:13:52Z Terminated Exit Code: 1
2021-10-27T22:13:52Z Restarting Task restarting in 1.197399231s
2021-10-27T22:14:00Z Started    Task started by client
2021-10-27T22:14:16Z Terminated Exit Code: 1
2021-10-27T22:14:16Z Restarting Task restarting in 1.226825213s
2021-10-27T22:14:24Z Started    Task started by client

Recent Logs
***v19 failed - Failed due to unhealthy allocations - not rolling back to stable job version 19 as current job has same specification and deploying as v20

Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/

Maybe we can disable this region for now.

It looks like the migrations already ran? However, I’m glad to see the “Primary DB connection - Running in primary region”.

Is it just the syd region with problems? You can run fly logs to see what happened there. You can also request logs for a specific region as well if that helps.

fly logs -r syd
1 Like

When debugging, I also often open a second terminal and run fly status to watch the different instances roll out.

It’s a problem with Oban! I remember you mentioned something about Oban setup in your talk.
I will try to configure it right!

2021-10-27T22:27:48.041 app[8bd6543d] syd [info] State: %Oban.PostgresNotifier.State{circuit: :enabled, conf: %Oban.Config{circuit_backoff: 30000, dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "bk-app-cluster-test@fdaa:0:332b:a7b:2983:8bd6:543d:2", notifier: Oban.PostgresNotifier, plugins: [{Oban.Plugins.Cron, [crontab: [{"@daily", Brandkit.Workers.InstagramWorker}, {"@daily", Brandkit.Workers.NotifyAdminsAboutExpiringAssetsWorker}, {"@daily", Brandkit.Workers.NotifyUsersAboutExpiringAssetsWorker}]]}, Oban.Plugins.Pruner, Oban.Plugins.Stager], prefix: "public", queues: [default: [limit: 10]], repo: Brandkit.Repo.Local, shutdown_grace_period: 15000}, conn: nil, listeners: %{}, name: {:via, Registry, {Oban.Registry, {Oban, Oban.Notifier}}}, reset_timer: nil}
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] 22:27:48.037 [error] GenServer {Oban.Registry, {Oban, Oban.Midwife}} terminating
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] ** (stop) exited in: GenServer.call(#PID<0.3314.0>, {:listen, [:signal]}, 5000)
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]     ** (EXIT) exited in: :gen_server.call(#PID<0.3322.0>, {:listen, "public.oban_gossip"}, 5000)
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]         ** (EXIT) %Postgrex.Error{connection_id: 27144, message: nil, postgres: %{code: :read_only_sql_transaction, file: "utility.c", line: "447", message: "cannot execute LISTEN during recovery", pg_code: "25006", routine: "PreventCommandDuringRecovery", severity: "ERROR", unknown: "ERROR"}, query: nil}
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]     (elixir 1.12.1) lib/gen_server.ex:1024: GenServer.call/3
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]     (oban 2.8.0) lib/oban/midwife.ex:31: Oban.Midwife.handle_continue/2
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]     (stdlib 3.15.1) gen_server.erl:695: :gen_server.try_dispatch/4
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]     (stdlib 3.15.1) gen_server.erl:437: :gen_server.loop/7
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]     (stdlib 3.15.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] Last message: {:continue, :start}
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] State: %Oban.Midwife.State{conf: %Oban.Config{circuit_backoff: 30000, dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "bk-app-cluster-test@fdaa:0:332b:a7b:2983:8bd6:543d:2", notifier: Oban.PostgresNotifier, plugins: [{Oban.Plugins.Cron, [crontab: [{"@daily", Brandkit.Workers.InstagramWorker}, {"@daily", Brandkit.Workers.NotifyAdminsAboutExpiringAssetsWorker}, {"@daily", Brandkit.Workers.NotifyUsersAboutExpiringAssetsWorker}]]}, Oban.Plugins.Pruner, Oban.Plugins.Stager], prefix: "public", queues: [default: [limit: 10]], repo: Brandkit.Repo.Local, shutdown_grace_period: 15000}}
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] 22:27:48.039 [info] Application brandkit exited: Brandkit.Application.start(:normal, []) returned an error: shutdown: failed to start child: Oban
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]     ** (EXIT) shutdown: failed to start child: "default"
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]         ** (EXIT) shutdown: failed to start child: Oban.Queue.Producer
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]             ** (EXIT) exited in: GenServer.call(#PID<0.3314.0>, {:listen, [:insert, :signal]}, 5000)
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]                 ** (EXIT) exited in: :gen_server.call(#PID<0.3322.0>, {:listen, "public.oban_gossip"}, 5000)
2021-10-27T22:27:48.041 app[8bd6543d] syd [info]                     ** (EXIT) %Postgrex.Error{connection_id: 27144, message: nil, postgres: %{code: :read_only_sql_transaction, file: "utility.c", line: "447", message: "cannot execute LISTEN during recovery", pg_code: "25006", routine: "PreventCommandDuringRecovery", severity: "ERROR", unknown: "ERROR"}, query: nil}

We are close! :smiley:

Yay! Parker Selbert (Oban creator) said the feature to use for that is “queue splitting”. I haven’t done this setup myself. So hopefully we can get it all figured out!

1 Like

Hi @Mark! We did it!
The application works very fast! :raised_hands:

We had to face with Oban, but we can to bypass it.
We have problems with the PostgresNotifier that perform LISTEN in the replicas.
We are still looking for a good solution. But that’s another issue.

Now we see this log many times:

2021-10-28T22:06:34.417 app[9b7b5b68] scl [info] 22:06:34.417 [debug] QUERY OK db=2.0ms queue=1.2ms idle=1052.3ms
2021-10-28T22:06:34.417 app[9b7b5b68] scl [info] select CAST(pg_last_wal_replay_lsn() AS TEXT) []
2021-10-28T22:06:34.304 app[c5fbe636] syd [info] 22:06:34.304 [debug] QUERY OK db=0.6ms queue=0.7ms idle=1018.9ms
2021-10-28T22:06:34.304 app[c5fbe636] syd [info] select CAST(pg_last_wal_replay_lsn() AS TEXT) []
2021-10-28T22:06:34.521 app[9b7b5b68] scl [info] 22:06:34.520 [debug] QUERY OK db=1.1ms queue=1.4ms idle=1052.4ms
2021-10-28T22:06:34.521 app[9b7b5b68] scl [info] select CAST(pg_last_wal_replay_lsn() AS TEXT) []
2021-10-28T22:06:34.406 app[c5fbe636] syd [info] 22:06:34.405 [debug] QUERY OK db=0.3ms queue=0.6ms idle=1018.9ms 

This is an expected behavior?

Should we change the log level?

1 Like

Awesome!

A couple points:

  • Those queries you are seeing are happening too frequently. Currently it’s running them continuously every 100ms. I realized this the other day and I’m working on a fix. It is supposed to only run those polling queries when a process has performed a write also requested to be notified after the changes have replicated. So that fix will come soon.
  • We’re currently investigating another way of solving this problem. I’ll let you know how that goes. However I still think the Fly.RPC approach will still be valuable.

Awesome that it’s working well for you!

2 Likes

Hey Mark!

Regarding your second point, would you recommend waiting on implementing this for the moment, seeing as you’re still experimenting with ways to accomplish this?
If so, when do you expect to be able to share information about the second solution?

Really appreciate the work here and looking forward to be able to put it to use.

@cschmatzler this codebase is on GitHub - superfly/fly_postgres_elixir: Library for working with local read-replica postgres databases and performing writes through RPC calls to other nodes in the primary Fly.io region. so it might actually be easier to see updates / get notifications if you’re watching / subscribed to updates there.

I doubt there’s a formal changelog just yet, so just tracking the main branch is a good way to keep up to date.

We’ll probably post significant updates or releases once in a while on this forum as well.

1 Like

Good question. It’s still R&D at the moment, but I think it may just make for better internals in the library.

@fedeotaran I just published fly_postgres v0.1.11. This is smarter about the polling and only polls the replica DB for replication progress when there are pending notification requests.

Also note there was recent change in v0.1.10 requires a config change. Hope those changes are helpful for your testing!

1 Like

Excellent news @Mark! I will upgrade and test it! :raised_hands:

Thanks for advising! :slight_smile:

We are still looking for how to make Oban work correctly with fly_postgres repo.
It is not enough to simply configure it with MyApp.Repo.Local.

Is it a good idea to open an issue about this here or an issue on Oban repository board? What do you think?

Good question. I’m not actually an oban user… so I can’t give specific advice. However I’d love to have a good solution for it. I’ll look into that.

Check out this example and see if it helps. It’s a PR against the hello_elixir app that setups up Oban and only activates jobs when running in the primary region. It’s a work-in-progress and I’m asking for feedback from others like you!

EXAMPLE: Oban jobs configured to only run in primary region

Let me know what you think.

2 Likes

Hi @Mark! Nice example. Thanks! :smiley:

We did something similar. But your example is cleaner.

defp oban_opts do
    if primary_region?() do
      Application.get_env(:brandkit, Oban)
    else
      Application.get_env(:brandkit, Oban)
      |> Keyword.put(:queues, false)
      |> Keyword.put(:plugins, false)
      |> Keyword.put(:notifier, Brandkit.DummyNotifier)
    end
  end

We also had to create a DammyNotifier and configure it for replicas because the PostgresNotifier was executing LISTEN/NOTIFY operations that failed for readonly connections.

We override listen and unlisten methods:

  @impl Oban.Notifier
  def listen(_server, _channels) do
    :ok
  end

  @impl Oban.Notifier
  def unlisten(_server, _channels) do
    :ok
  end

Didn’t you have this problem? :thinking:

Ohhh! Using the original repo (MyApp.Repo) this problem was solved.

We first create the DammyNotifier and then switch to using the MyApp.Repo

But these requests (LISTEN/NOTIFY) and job inserts on the replicas run directly against the primary database. This is correct?

And I have to be sure to set de complete url (with the region).
I did it like this:

defmodule Brandkit.Oban.Repo do
  use Ecto.Repo,
    otp_app: :brandkit,
    adapter: Ecto.Adapters.Postgres

  def init(_type, config) do
      {:ok, Keyword.put(config, :url, Fly.Postgres.primary_db_url())}
  end
end

I’m seeing that I need to set all three of these ENV variables to get my new app to run locally using iex -S mix phx.server:

export DATABASE_URL=""
export FLY_REGION=local
export PRIMARY_REGION=local

Is that expected? I’d prefer for it to know that I’m in dev mode automatically :slight_smile:

I’m also seeing “Using raw DATABASE_URL. Assumed DEV or TEST environment” in the terminal multiple times – is that expected as well?

$ iex -S mix phx.server
[info] Using raw DATABASE_URL. Assumed DEV or TEST environment
""
[info] Using raw DATABASE_URL. Assumed DEV or TEST environment
[debug] QUERY OK db=0.5ms decode=1.6ms queue=117.2ms idle=0.0ms
select CAST(pg_last_wal_replay_lsn() AS TEXT) []
[info] Running DotsWeb.Endpoint with cowboy 2.9.0 at 127.0.0.1:4000 (http)
[info] Access DotsWeb.Endpoint at http://localhost:4000
""
[info] Using raw DATABASE_URL. Assumed DEV or TEST environment
[info] Using raw DATABASE_URL. Assumed DEV or TEST environment

Also this page says distributed postgres isn’t right for write-heavy apps – is this still the case? For instance, if I’m building a note-taking app, i should probably look elsewhere?

Good questions @abreu.jamil.

Yes, the ENV values are needed and “local” is fine. :slight_smile:

The DATABASE_URL is also needed.

Yes, in production, it would customize the database url for the region it’s running in. That happens during the database connection setup. You have multiple database connections running in your pool, so you’ll see it multiple times on startup.