Oh!
I think you need to remove the ENV PRIMARY_REGION=iad
from your Dockerfile. Put it in your fly.toml
file.
[env]
PRIMARY_REGION = "iad"
That way things outside the Dockerfile know what your primary region is. Can you try that?
Oh!
I think you need to remove the ENV PRIMARY_REGION=iad
from your Dockerfile. Put it in your fly.toml
file.
[env]
PRIMARY_REGION = "iad"
That way things outside the Dockerfile know what your primary region is. Can you try that?
Ohhh!
Yes! Migrations ran successfully! But we have an error with the syd
instance.
It seems to be a totally different problem. Right?
Command: /app/bin/brandkit eval Brandkit.Release.migrate
Starting instance
Configuring virtual machine
Pulling container image
Unpacking image
Preparing kernel init
Configuring firecracker
Starting virtual machine
2021/10/27 22:13:08 listening on [fdaa:0:332b:a7b:21e0:8630:1bfc:2]:22 (DNS: [fdaa::3]:53)
22:13:08.848 [info] Primary DB connection - Running in primary region
22:13:08.893 [info] Primary DB connection - Running in primary region
22:13:08.922 [info] Primary DB connection - Running in primary region
22:13:08.930 [info] Primary DB connection - Running in primary region
Reaped child process with pid: 576 and signal: SIGUSR1, core dumped? false
22:13:10.076 [info] Migrations already up
Rain child exited normally with code: 0
Seaped child process with pid: 578 and signal: SIGUSR1, core dumped? false
Starting clean up.
Monitoring Deployment
3 desired, 1 placed, 0 healthy, 1 unhealthy [restarts: 2] [health checks: 1 total, 1 critical]
v19 failed - Failed due to unhealthy allocations - not rolling back to stable job version 19 as current job has same specification
Failed Instances
==> Failure #1
Instance
ID = a9c1e82b
Process =
Version = 19
Region = syd
Desired = run
Status = running
Health Checks = 1 total, 1 critical
Restarts = 2
Created = 1m12s ago
Recent Events
TIMESTAMP TYPE MESSAGE
2021-10-27T22:13:22Z Received Task received by client
2021-10-27T22:13:22Z Task Setup Building Task Directory
2021-10-27T22:13:36Z Started Task started by client
2021-10-27T22:13:52Z Terminated Exit Code: 1
2021-10-27T22:13:52Z Restarting Task restarting in 1.197399231s
2021-10-27T22:14:00Z Started Task started by client
2021-10-27T22:14:16Z Terminated Exit Code: 1
2021-10-27T22:14:16Z Restarting Task restarting in 1.226825213s
2021-10-27T22:14:24Z Started Task started by client
Recent Logs
***v19 failed - Failed due to unhealthy allocations - not rolling back to stable job version 19 as current job has same specification and deploying as v20
Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Maybe we can disable this region for now.
It looks like the migrations already ran? However, I’m glad to see the “Primary DB connection - Running in primary region”.
Is it just the syd
region with problems? You can run fly logs
to see what happened there. You can also request logs for a specific region as well if that helps.
fly logs -r syd
When debugging, I also often open a second terminal and run fly status
to watch the different instances roll out.
It’s a problem with Oban! I remember you mentioned something about Oban setup in your talk.
I will try to configure it right!
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] State: %Oban.PostgresNotifier.State{circuit: :enabled, conf: %Oban.Config{circuit_backoff: 30000, dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "bk-app-cluster-test@fdaa:0:332b:a7b:2983:8bd6:543d:2", notifier: Oban.PostgresNotifier, plugins: [{Oban.Plugins.Cron, [crontab: [{"@daily", Brandkit.Workers.InstagramWorker}, {"@daily", Brandkit.Workers.NotifyAdminsAboutExpiringAssetsWorker}, {"@daily", Brandkit.Workers.NotifyUsersAboutExpiringAssetsWorker}]]}, Oban.Plugins.Pruner, Oban.Plugins.Stager], prefix: "public", queues: [default: [limit: 10]], repo: Brandkit.Repo.Local, shutdown_grace_period: 15000}, conn: nil, listeners: %{}, name: {:via, Registry, {Oban.Registry, {Oban, Oban.Notifier}}}, reset_timer: nil}
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] 22:27:48.037 [error] GenServer {Oban.Registry, {Oban, Oban.Midwife}} terminating
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] ** (stop) exited in: GenServer.call(#PID<0.3314.0>, {:listen, [:signal]}, 5000)
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] ** (EXIT) exited in: :gen_server.call(#PID<0.3322.0>, {:listen, "public.oban_gossip"}, 5000)
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] ** (EXIT) %Postgrex.Error{connection_id: 27144, message: nil, postgres: %{code: :read_only_sql_transaction, file: "utility.c", line: "447", message: "cannot execute LISTEN during recovery", pg_code: "25006", routine: "PreventCommandDuringRecovery", severity: "ERROR", unknown: "ERROR"}, query: nil}
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] (elixir 1.12.1) lib/gen_server.ex:1024: GenServer.call/3
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] (oban 2.8.0) lib/oban/midwife.ex:31: Oban.Midwife.handle_continue/2
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] (stdlib 3.15.1) gen_server.erl:695: :gen_server.try_dispatch/4
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] (stdlib 3.15.1) gen_server.erl:437: :gen_server.loop/7
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] (stdlib 3.15.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] Last message: {:continue, :start}
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] State: %Oban.Midwife.State{conf: %Oban.Config{circuit_backoff: 30000, dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "bk-app-cluster-test@fdaa:0:332b:a7b:2983:8bd6:543d:2", notifier: Oban.PostgresNotifier, plugins: [{Oban.Plugins.Cron, [crontab: [{"@daily", Brandkit.Workers.InstagramWorker}, {"@daily", Brandkit.Workers.NotifyAdminsAboutExpiringAssetsWorker}, {"@daily", Brandkit.Workers.NotifyUsersAboutExpiringAssetsWorker}]]}, Oban.Plugins.Pruner, Oban.Plugins.Stager], prefix: "public", queues: [default: [limit: 10]], repo: Brandkit.Repo.Local, shutdown_grace_period: 15000}}
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] 22:27:48.039 [info] Application brandkit exited: Brandkit.Application.start(:normal, []) returned an error: shutdown: failed to start child: Oban
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] ** (EXIT) shutdown: failed to start child: "default"
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] ** (EXIT) shutdown: failed to start child: Oban.Queue.Producer
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] ** (EXIT) exited in: GenServer.call(#PID<0.3314.0>, {:listen, [:insert, :signal]}, 5000)
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] ** (EXIT) exited in: :gen_server.call(#PID<0.3322.0>, {:listen, "public.oban_gossip"}, 5000)
2021-10-27T22:27:48.041 app[8bd6543d] syd [info] ** (EXIT) %Postgrex.Error{connection_id: 27144, message: nil, postgres: %{code: :read_only_sql_transaction, file: "utility.c", line: "447", message: "cannot execute LISTEN during recovery", pg_code: "25006", routine: "PreventCommandDuringRecovery", severity: "ERROR", unknown: "ERROR"}, query: nil}
We are close!
Yay! Parker Selbert (Oban creator) said the feature to use for that is “queue splitting”. I haven’t done this setup myself. So hopefully we can get it all figured out!
Hi @Mark! We did it!
The application works very fast!
We had to face with Oban, but we can to bypass it.
We have problems with the PostgresNotifier that perform LISTEN in the replicas.
We are still looking for a good solution. But that’s another issue.
Now we see this log many times:
2021-10-28T22:06:34.417 app[9b7b5b68] scl [info] 22:06:34.417 [debug] QUERY OK db=2.0ms queue=1.2ms idle=1052.3ms
2021-10-28T22:06:34.417 app[9b7b5b68] scl [info] select CAST(pg_last_wal_replay_lsn() AS TEXT) []
2021-10-28T22:06:34.304 app[c5fbe636] syd [info] 22:06:34.304 [debug] QUERY OK db=0.6ms queue=0.7ms idle=1018.9ms
2021-10-28T22:06:34.304 app[c5fbe636] syd [info] select CAST(pg_last_wal_replay_lsn() AS TEXT) []
2021-10-28T22:06:34.521 app[9b7b5b68] scl [info] 22:06:34.520 [debug] QUERY OK db=1.1ms queue=1.4ms idle=1052.4ms
2021-10-28T22:06:34.521 app[9b7b5b68] scl [info] select CAST(pg_last_wal_replay_lsn() AS TEXT) []
2021-10-28T22:06:34.406 app[c5fbe636] syd [info] 22:06:34.405 [debug] QUERY OK db=0.3ms queue=0.6ms idle=1018.9ms
This is an expected behavior?
Should we change the log level?
Awesome!
A couple points:
Awesome that it’s working well for you!
Hey Mark!
Regarding your second point, would you recommend waiting on implementing this for the moment, seeing as you’re still experimenting with ways to accomplish this?
If so, when do you expect to be able to share information about the second solution?
Really appreciate the work here and looking forward to be able to put it to use.
@cschmatzler this codebase is on GitHub - superfly/fly_postgres_elixir: Library for working with local read-replica postgres databases and performing writes through RPC calls to other nodes in the primary Fly.io region. so it might actually be easier to see updates / get notifications if you’re watching / subscribed to updates there.
I doubt there’s a formal changelog just yet, so just tracking the main
branch is a good way to keep up to date.
We’ll probably post significant updates or releases once in a while on this forum as well.
Good question. It’s still R&D at the moment, but I think it may just make for better internals in the library.
@fedeotaran I just published fly_postgres
v0.1.11. This is smarter about the polling and only polls the replica DB for replication progress when there are pending notification requests.
Also note there was recent change in v0.1.10 requires a config change. Hope those changes are helpful for your testing!
Excellent news @Mark! I will upgrade and test it!
Thanks for advising!
We are still looking for how to make Oban work correctly with fly_postgres repo.
It is not enough to simply configure it with MyApp.Repo.Local
.
Is it a good idea to open an issue about this here or an issue on Oban repository board? What do you think?
Good question. I’m not actually an oban user… so I can’t give specific advice. However I’d love to have a good solution for it. I’ll look into that.
Check out this example and see if it helps. It’s a PR against the hello_elixir
app that setups up Oban and only activates jobs when running in the primary region. It’s a work-in-progress and I’m asking for feedback from others like you!
EXAMPLE: Oban jobs configured to only run in primary region
Let me know what you think.
Hi @Mark! Nice example. Thanks!
We did something similar. But your example is cleaner.
defp oban_opts do
if primary_region?() do
Application.get_env(:brandkit, Oban)
else
Application.get_env(:brandkit, Oban)
|> Keyword.put(:queues, false)
|> Keyword.put(:plugins, false)
|> Keyword.put(:notifier, Brandkit.DummyNotifier)
end
end
We also had to create a DammyNotifier and configure it for replicas because the PostgresNotifier was executing LISTEN/NOTIFY
operations that failed for readonly connections.
We override listen
and unlisten
methods:
@impl Oban.Notifier
def listen(_server, _channels) do
:ok
end
@impl Oban.Notifier
def unlisten(_server, _channels) do
:ok
end
Didn’t you have this problem?
Ohhh! Using the original repo (MyApp.Repo
) this problem was solved.
We first create the DammyNotifier
and then switch to using the MyApp.Repo
But these requests (LISTEN/NOTIFY
) and job inserts on the replicas run directly against the primary database. This is correct?
And I have to be sure to set de complete url (with the region).
I did it like this:
defmodule Brandkit.Oban.Repo do
use Ecto.Repo,
otp_app: :brandkit,
adapter: Ecto.Adapters.Postgres
def init(_type, config) do
{:ok, Keyword.put(config, :url, Fly.Postgres.primary_db_url())}
end
end
I’m seeing that I need to set all three of these ENV variables to get my new app to run locally using iex -S mix phx.server
:
export DATABASE_URL=""
export FLY_REGION=local
export PRIMARY_REGION=local
Is that expected? I’d prefer for it to know that I’m in dev mode automatically
I’m also seeing “Using raw DATABASE_URL. Assumed DEV or TEST environment” in the terminal multiple times – is that expected as well?
$ iex -S mix phx.server
[info] Using raw DATABASE_URL. Assumed DEV or TEST environment
""
[info] Using raw DATABASE_URL. Assumed DEV or TEST environment
[debug] QUERY OK db=0.5ms decode=1.6ms queue=117.2ms idle=0.0ms
select CAST(pg_last_wal_replay_lsn() AS TEXT) []
[info] Running DotsWeb.Endpoint with cowboy 2.9.0 at 127.0.0.1:4000 (http)
[info] Access DotsWeb.Endpoint at http://localhost:4000
""
[info] Using raw DATABASE_URL. Assumed DEV or TEST environment
[info] Using raw DATABASE_URL. Assumed DEV or TEST environment
Also this page says distributed postgres isn’t right for write-heavy apps – is this still the case? For instance, if I’m building a note-taking app, i should probably look elsewhere?
Good questions @abreu.jamil.
Yes, the ENV values are needed and “local” is fine.
The DATABASE_URL is also needed.
Yes, in production, it would customize the database url for the region it’s running in. That happens during the database connection setup. You have multiple database connections running in your pool, so you’ll see it multiple times on startup.