Scaling Postgres volume - zero downtime

Hi there! We’re trying to scale the volume of our PG cluster (we started with 1GB and now we want to increase it to 10GB). We’re following steps described here.

Steps we’re taking:

  1. Create volume + read replica in the same region
  2. Scale to 2 (and wait until stable)
  3. Delete old volume
  4. Scale back down to 1

Setup:

  • We’re trying this with a fresh DB cluster built from volume snapshots of our production before we do it for real.
  • We have our app running locally, but connected to this fresh DB cluster via flyctl proxy
  • We’re running a Phoenix/Elixir app with Oban for background processing
  • We primarily use websockets

Process:
We have no issues going through steps 1 and 2, but every time we reach the 3rd step to delete the old volume, our application loses connection. After a minute or so, it’s able to recover on its own. We also note that the old PG instance (leader) was removed. Errors shown below:

[error] Postgrex.Protocol (#PID<0.1797.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}} terminating
** (DBConnection.ConnectionError) tcp recv: closed
    (db_connection 2.4.1) lib/db_connection.ex:902: DBConnection.transaction/3
    (oban 2.11.3) lib/oban/plugins/stager.ex:81: anonymous fn/2 in Oban.Plugins.Stager.handle_info/2
    (telemetry 1.1.0) /[redacted]/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
    (oban 2.11.3) lib/oban/plugins/stager.ex:80: Oban.Plugins.Stager.handle_info/2
    (stdlib 3.17.1) gen_server.erl:695: :gen_server.try_dispatch/4
    (stdlib 3.17.1) gen_server.erl:771: :gen_server.handle_msg/6
    (stdlib 3.17.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: :stage
State: %Oban.Plugins.Stager.State{conf: %Oban.Config{dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "[redacted]", notifier: Oban.Notifiers.Postgres, peer: Oban.Peer, plugins: [Oban.Plugins.Stager], prefix: "private", queues: [post_registration: [limit: 2], list_events: [limit: 22], process_event: [limit: 12], register_webhook: [limit: 2]], repo: Ajourney.Repo, shutdown_grace_period: 15000}, interval: 1000, limit: 5000, name: {:via, Registry, {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}}}, timer: #Reference<0.933448173.222822404.191439>}
[error] Postgrex.Protocol (#PID<0.1797.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1802.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}} terminating
** (DBConnection.ConnectionError) tcp recv: closed
    (db_connection 2.4.1) lib/db_connection.ex:902: DBConnection.transaction/3
    (oban 2.11.3) lib/oban/plugins/stager.ex:81: anonymous fn/2 in Oban.Plugins.Stager.handle_info/2
    (telemetry 1.1.0) /[redacted]/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
    (oban 2.11.3) lib/oban/plugins/stager.ex:80: Oban.Plugins.Stager.handle_info/2
    (stdlib 3.17.1) gen_server.erl:695: :gen_server.try_dispatch/4
    (stdlib 3.17.1) gen_server.erl:771: :gen_server.handle_msg/6
    (stdlib 3.17.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: :stage
State: %Oban.Plugins.Stager.State{conf: %Oban.Config{dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "[redacted]", notifier: Oban.Notifiers.Postgres, peer: Oban.Peer, plugins: [Oban.Plugins.Stager], prefix: "private", queues: [post_registration: [limit: 2], list_events: [limit: 22], process_event: [limit: 12], register_webhook: [limit: 2]], repo: Ajourney.Repo, shutdown_grace_period: 15000}, interval: 1000, limit: 5000, name: {:via, Registry, {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}}}, timer: #Reference<0.933448173.222822404.191477>}
[error] Postgrex.Protocol (#PID<0.1802.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1799.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1803.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1795.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1794.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1800.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1798.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1796.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1801.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1800.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1794.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1801.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1803.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1799.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1795.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1796.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed
[error] Postgrex.Protocol (#PID<0.1798.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv: closed

After things are stable again, we proceed with the last step to scale it back down to 1 with fly scale count 1.

We also noted that it took a few minutes for the replica that was left to become the leader.

ID              PROCESS VERSION REGION  DESIRED STATUS                  HEALTH CHECKS           RESTARTS        CREATED
a90f7891        app     5       sin     run     running (replica)        3 total, 3 passing      0               4m02s ago
ID              PROCESS VERSION REGION  DESIRED STATUS                  HEALTH CHECKS           RESTARTS        CREATED
a90f7891        app     5       sin     run     running (leader)        3 total, 3 passing      0               16m16s ago

We have a few questions:

  1. Is it possible to have 0 downtime? Or should we just accept that there’s going to be downtime for a few minutes?
  2. After scaling down, will the fact that the remaining instance takes some time to be elected as the leader again affect write capabilities?
  3. How long does it typically take for the remaining replica to be elected as leader?

Thank you!

After some more digging, I suspect I’m losing connection because my local app instance is connecting to PG locally via the proxied connection (using flyctl proxy). Could it be that when I remove the old volume, it shuts down the leader instance and causes flyctl proxy to disconnect? So technically this wouldn’t be an issue if I wasn’t using the local proxy since fly-proxy would be able to resolve the right instance?

Oh yes, that’d do it. When you shut down a postgres, we update the dns entry for <db-name>.internal. Ecto should then reconnect to the other node when it’s running in our environment. fly proxy isn’t that smart, it connects you to a specific IP and when that IP stops working it’s in trouble.

1 Like

Ahh great, thanks for confirming!