Scaling Postgres volume - zero downtime

thomasjiang · June 2, 2022, 3:26am

Hi there! We’re trying to scale the volume of our PG cluster (we started with 1GB and now we want to increase it to 10GB). We’re following steps described here.

Steps we’re taking:

Create volume + read replica in the same region
Scale to 2 (and wait until stable)
Delete old volume
Scale back down to 1

Setup:

We’re trying this with a fresh DB cluster built from volume snapshots of our production before we do it for real.
We have our app running locally, but connected to this fresh DB cluster via flyctl proxy
We’re running a Phoenix/Elixir app with Oban for background processing
We primarily use websockets

Process:
We have no issues going through steps 1 and 2, but every time we reach the 3rd step to delete the old volume, our application loses connection. After a minute or so, it’s able to recover on its own. We also note that the old PG instance (leader) was removed. Errors shown below:

[error] Postgrex.Protocol (#PID<0.1797.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}} terminating
** (DBConnection.ConnectionError) tcp recv: closed
    (db_connection 2.4.1) lib/db_connection.ex:902: DBConnection.transaction/3
    (oban 2.11.3) lib/oban/plugins/stager.ex:81: anonymous fn/2 in Oban.Plugins.Stager.handle_info/2
    (telemetry 1.1.0) /[redacted]/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
    (oban 2.11.3) lib/oban/plugins/stager.ex:80: Oban.Plugins.Stager.handle_info/2
    (stdlib 3.17.1) gen_server.erl:695: :gen_server.try_dispatch/4
    (stdlib 3.17.1) gen_server.erl:771: :gen_server.handle_msg/6
    (stdlib 3.17.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: :stage
State: %Oban.Plugins.Stager.State{conf: %Oban.Config{dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "[redacted]", notifier: Oban.Notifiers.Postgres, peer: Oban.Peer, plugins: [Oban.Plugins.Stager], prefix: "private", queues: [post_registration: [limit: 2], list_events: [limit: 22], process_event: [limit: 12], register_webhook: [limit: 2]], repo: Ajourney.Repo, shutdown_grace_period: 15000}, interval: 1000, limit: 5000, name: {:via, Registry, {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}}}, timer: #Reference<0.933448173.222822404.191439>}
[error] Postgrex.Protocol (#PID<0.1797.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1802.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}} terminating
** (DBConnection.ConnectionError) tcp recv: closed
    (db_connection 2.4.1) lib/db_connection.ex:902: DBConnection.transaction/3
    (oban 2.11.3) lib/oban/plugins/stager.ex:81: anonymous fn/2 in Oban.Plugins.Stager.handle_info/2
    (telemetry 1.1.0) /[redacted]/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
    (oban 2.11.3) lib/oban/plugins/stager.ex:80: Oban.Plugins.Stager.handle_info/2
    (stdlib 3.17.1) gen_server.erl:695: :gen_server.try_dispatch/4
    (stdlib 3.17.1) gen_server.erl:771: :gen_server.handle_msg/6
    (stdlib 3.17.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: :stage
State: %Oban.Plugins.Stager.State{conf: %Oban.Config{dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "[redacted]", notifier: Oban.Notifiers.Postgres, peer: Oban.Peer, plugins: [Oban.Plugins.Stager], prefix: "private", queues: [post_registration: [limit: 2], list_events: [limit: 22], process_event: [limit: 12], register_webhook: [limit: 2]], repo: Ajourney.Repo, shutdown_grace_period: 15000}, interval: 1000, limit: 5000, name: {:via, Registry, {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}}}, timer: #Reference<0.933448173.222822404.191477>}
[error] Postgrex.Protocol (#PID<0.1802.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1799.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1803.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1795.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1794.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1800.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1798.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1796.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1801.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1800.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1794.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1801.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1803.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1799.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1795.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1796.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1798.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed

After things are stable again, we proceed with the last step to scale it back down to 1 with fly scale count 1.

We also noted that it took a few minutes for the replica that was left to become the leader.

ID              PROCESS VERSION REGION  DESIRED STATUS                  HEALTH CHECKS           RESTARTS        CREATED
a90f7891        app     5       sin     run     running (replica)        3 total, 3 passing      0               4m02s ago

ID              PROCESS VERSION REGION  DESIRED STATUS                  HEALTH CHECKS           RESTARTS        CREATED
a90f7891        app     5       sin     run     running (leader)        3 total, 3 passing      0               16m16s ago

We have a few questions:

Is it possible to have 0 downtime? Or should we just accept that there’s going to be downtime for a few minutes?
After scaling down, will the fact that the remaining instance takes some time to be elected as the leader again affect write capabilities?
How long does it typically take for the remaining replica to be elected as leader?

Thank you!

thomasjiang · June 2, 2022, 3:59am

After some more digging, I suspect I’m losing connection because my local app instance is connecting to PG locally via the proxied connection (using flyctl proxy). Could it be that when I remove the old volume, it shuts down the leader instance and causes flyctl proxy to disconnect? So technically this wouldn’t be an issue if I wasn’t using the local proxy since fly-proxy would be able to resolve the right instance?

kurt · June 2, 2022, 4:00am

Oh yes, that’d do it. When you shut down a postgres, we update the dns entry for <db-name>.internal. Ecto should then reconnect to the other node when it’s running in our environment. fly proxy isn’t that smart, it connects you to a specific IP and when that IP stops working it’s in trouble.

thomasjiang · June 2, 2022, 4:01am

Ahh great, thanks for confirming!

Topic		Replies	Views
Database connection count growing with deploys Questions / Help elixir , postgres	1	673	January 7, 2022
How to scale volumes of postgres leader?	8	983	June 21, 2022
Postgres clusters periodically down across many of our organizations Questions / Help postgres	7	1579	October 13, 2022
Oban stopped executing background jobs after Postgres image update until app redeployment Questions / Help postgres	0	304	May 25, 2022
How to convert your not-free Postgres to free Postgres	11	6454	December 1, 2023

Scaling Postgres volume - zero downtime

Related topics