Hi there! We’re trying to scale the volume of our PG cluster (we started with 1GB and now we want to increase it to 10GB). We’re following steps described here.
Steps we’re taking:
- Create volume + read replica in the same region
- Scale to 2 (and wait until stable)
- Delete old volume
- Scale back down to 1
Setup:
- We’re trying this with a fresh DB cluster built from volume snapshots of our production before we do it for real.
- We have our app running locally, but connected to this fresh DB cluster via
flyctl proxy
- We’re running a Phoenix/Elixir app with Oban for background processing
- We primarily use websockets
Process:
We have no issues going through steps 1 and 2, but every time we reach the 3rd step to delete the old volume, our application loses connection. After a minute or so, it’s able to recover on its own. We also note that the old PG instance (leader) was removed. Errors shown below:
[error] Postgrex.Protocol (#PID<0.1797.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}} terminating
** (DBConnection.ConnectionError) tcp recv: closed
(db_connection 2.4.1) lib/db_connection.ex:902: DBConnection.transaction/3
(oban 2.11.3) lib/oban/plugins/stager.ex:81: anonymous fn/2 in Oban.Plugins.Stager.handle_info/2
(telemetry 1.1.0) /[redacted]/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
(oban 2.11.3) lib/oban/plugins/stager.ex:80: Oban.Plugins.Stager.handle_info/2
(stdlib 3.17.1) gen_server.erl:695: :gen_server.try_dispatch/4
(stdlib 3.17.1) gen_server.erl:771: :gen_server.handle_msg/6
(stdlib 3.17.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: :stage
State: %Oban.Plugins.Stager.State{conf: %Oban.Config{dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "[redacted]", notifier: Oban.Notifiers.Postgres, peer: Oban.Peer, plugins: [Oban.Plugins.Stager], prefix: "private", queues: [post_registration: [limit: 2], list_events: [limit: 22], process_event: [limit: 12], register_webhook: [limit: 2]], repo: Ajourney.Repo, shutdown_grace_period: 15000}, interval: 1000, limit: 5000, name: {:via, Registry, {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}}}, timer: #Reference<0.933448173.222822404.191439>}
[error] Postgrex.Protocol (#PID<0.1797.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1802.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}} terminating
** (DBConnection.ConnectionError) tcp recv: closed
(db_connection 2.4.1) lib/db_connection.ex:902: DBConnection.transaction/3
(oban 2.11.3) lib/oban/plugins/stager.ex:81: anonymous fn/2 in Oban.Plugins.Stager.handle_info/2
(telemetry 1.1.0) /[redacted]/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
(oban 2.11.3) lib/oban/plugins/stager.ex:80: Oban.Plugins.Stager.handle_info/2
(stdlib 3.17.1) gen_server.erl:695: :gen_server.try_dispatch/4
(stdlib 3.17.1) gen_server.erl:771: :gen_server.handle_msg/6
(stdlib 3.17.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: :stage
State: %Oban.Plugins.Stager.State{conf: %Oban.Config{dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "[redacted]", notifier: Oban.Notifiers.Postgres, peer: Oban.Peer, plugins: [Oban.Plugins.Stager], prefix: "private", queues: [post_registration: [limit: 2], list_events: [limit: 22], process_event: [limit: 12], register_webhook: [limit: 2]], repo: Ajourney.Repo, shutdown_grace_period: 15000}, interval: 1000, limit: 5000, name: {:via, Registry, {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}}}, timer: #Reference<0.933448173.222822404.191477>}
[error] Postgrex.Protocol (#PID<0.1802.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1799.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1803.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1795.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1794.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1800.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1798.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1796.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1801.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1800.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1794.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1801.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1803.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1799.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1795.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1796.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1798.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
After things are stable again, we proceed with the last step to scale it back down to 1 with fly scale count 1
.
We also noted that it took a few minutes for the replica that was left to become the leader.
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
a90f7891 app 5 sin run running (replica) 3 total, 3 passing 0 4m02s ago
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
a90f7891 app 5 sin run running (leader) 3 total, 3 passing 0 16m16s ago
We have a few questions:
- Is it possible to have 0 downtime? Or should we just accept that there’s going to be downtime for a few minutes?
- After scaling down, will the fact that the remaining instance takes some time to be elected as the leader again affect write capabilities?
- How long does it typically take for the remaining replica to be elected as leader?
Thank you!