Lost Postgres DB connection

Hi there!

We lost our prod DB connection for the past 2h and I can’t figure what is happening. Our logs almost look exclusively like that

2021-12-14T00:33:08.775 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:08.774Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-12-14T00:33:08.776 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:08.776Z	INFO	cmd/keeper.go:1543	already master
2021-12-14T00:33:08.806 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:08.805Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-12-14T00:33:08.806 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:08.806Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-12-14T00:33:09.348 app[1687fe90] cdg [info] sentinel | 2021-12-14T00:33:09.347Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "b967fead", "keeper": "abc02c3a2"}
2021-12-14T00:33:13.876 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:13.875Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-12-14T00:33:13.877 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:13.877Z	INFO	cmd/keeper.go:1543	already master
2021-12-14T00:33:13.910 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:13.910Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-12-14T00:33:13.910 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:13.910Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-12-14T00:33:14.457 app[1687fe90] cdg [info] sentinel | 2021-12-14T00:33:14.457Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "b967fead", "keeper": "abc02c3a2"}
2021-12-14T00:33:18.986 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:18.985Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-12-14T00:33:18.987 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:18.987Z	INFO	cmd/keeper.go:1543	already master
2021-12-14T00:33:19.014 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:19.014Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-12-14T00:33:19.015 app[1687fe90] cdg [info] keeper   | 2021-12-14T00:33:19.014Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-12-14T00:33:19.574 app[1687fe90] cdg [info] sentinel | 2021-12-14T00:33:19.573Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "b967fead", "keeper": "abc02c3a2"}

Any idea for what’s happening here? :confused:

Those logs look healthy. What does flyctl status -a <db> look like? Is your app logging any errors?

Hi Kurt!

flyctl status:

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS                  RESTARTS CREATED
1687fe90 app     7       cdg    run     running (replica) 3 total, 1 passing, 2 critical 11       2021-12-07T10:54:52Z

App logs:

2021-12-14T02:12:51.294 app[de504b0e] cdg [info] ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 1997ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]   1. Ensuring your database is available and that you can connect to it
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]   2. Tracking down slow queries and making sure they are running fast enough
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]   3. Increasing the pool_size (although this increases resource consumption)
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]   4. Allowing requests to wait longer by increasing :queue_target and :queue_interval
2021-12-14T02:12:51.294 app[de504b0e] cdg [info] See DBConnection.start_link/2 for more information
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]     (db_connection 2.4.1) lib/db_connection.ex:902: DBConnection.transaction/3
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]     (oban 2.9.2) lib/oban/plugins/stager.ex:83: anonymous fn/2 in Oban.Plugins.Stager.handle_info/2
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]     (telemetry 0.4.3) /app/deps/telemetry/src/telemetry.erl:272: :telemetry.span/3
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]     (oban 2.9.2) lib/oban/plugins/stager.ex:82: Oban.Plugins.Stager.handle_info/2
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]     (stdlib 3.15.2) gen_server.erl:695: :gen_server.try_dispatch/4
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]     (stdlib 3.15.2) gen_server.erl:771: :gen_server.handle_msg/6
2021-12-14T02:12:51.294 app[de504b0e] cdg [info]     (stdlib 3.15.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
2021-12-14T02:12:51.294 app[de504b0e] cdg [info] Last message: :stage
2021-12-14T02:12:51.357 app[327ba65e] lhr [info] 02:12:51.357 [error] Postgrex.Protocol (#PID<0.4155.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (***): non-existing domain - :nxdomain

edit:

DB seems to log this here and there:
2021-12-14T02:20:21.879 proxy[0dfeb2a3] cdg [error] Error 2008: App instance hard limit reached

Hey there,

I took a look at your logs and am seeing:

2021-12-14T02:21:29.791 proxy[0dfeb2a3] cdg [error] error.code=2008 error.message="App instance hard limit reached"
2021-12-14T02:21:30.508 proxy[0dfeb2a3] cdg [error] error.code=2008 error.message="App instance hard limit reached"
2021-12-14T02:21:31.558 proxy[0dfeb2a3] cdg [error] error.code=2008 error.message="App instance hard limit reached"
2021-12-14T02:21:32.288 proxy[0dfeb2a3] cdg [error] error.code=2008 error.message="App instance hard limit reached"

I would try pulling down your app’s configuration file and adjusting your services concurrency settings.

Here are the steps to do that:

  1. Pull down config file
    fly config save --app <postgres-app>

  2. Update concurrency settings.
    App Configuration (fly.toml)

  3. Redeploy
    fly deploy . --app <postgres-app> --image flyio/postgres:<major-pg-version>

Let me know if that helps!

1 Like

Worked like a charm! Many many thanks for you help! :grinning_face_with_smiling_eyes: