Postgres replica failed to connect

I somehow managed to bring down a replica and get the primary to reject all writes while running a data import.

I could still ssh into the replica throughout the connection interruption, which revealed that its data volume was almost completely empty, even though the primary already held plenty (~ 20GB) of data at the point.

2022-06-02T10:50:45.668 app[8980e284] iad [info] keeper | 2022-06-02T10:50:45.667Z INFO	cmd/keeper.go:1675 postgres parameters not changed
2022-06-02T10:50:45.668 app[8980e284] iad [info] keeper | 2022-06-02T10:50:45.668Z INFO	cmd/keeper.go:1702 postgres hba entries not changed
2022-06-02T10:50:50.716 app[8980e284] iad [info] keeper | 2022-06-02T10:50:50.716Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T10:50:50.717 app[8980e284] iad [info] keeper | 2022-06-02T10:50:50.717Z INFO	cmd/keeper.go:1542 already master
2022-06-02T10:50:50.742 app[8980e284] iad [info] keeper | 2022-06-02T10:50:50.741Z INFO	cmd/keeper.go:1675 postgres parameters not changed
2022-06-02T10:50:50.742 app[8980e284] iad [info] keeper | 2022-06-02T10:50:50.742Z INFO	cmd/keeper.go:1702 postgres hba entries not changed
2022-06-02T10:50:52.442 app[017ee4ea] iad [info] exporter | INFO[61731] Established new database connection to "fdaa:0:6931:a7b:775b:1:11c8:2:5433". source="postgres_exporter.go:970"
2022-06-02T10:50:53.443 app[017ee4ea] iad [info] exporter | ERRO[61732] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:6931:a7b:775b:1:11c8:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:6931:a7b:775b:1:11c8:2]:5433: connect: connection refused source="postgres_exporter.go:1658" 
App
  Name     = <app>
  Owner    = personal
  Version  = 0
  Status   = running
  Hostname = <hostname>

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS                	HEALTH CHECKS                 	RESTARTS	CREATED
017ee4ea	app    	0      	iad   	run    	running (failed to co)	3 total, 3 critical           	0       	17h14m ago
8980e284	app    	0      	iad   	run    	running (leader)      	3 total, 2 passing, 1 critical	0       	17h14m ago

Later on a pgbasebackup was also interrupted by what I believe to be a shared memory overrun.

2022-06-02T10:55:08.447 app[017ee4ea] iad [info] exporter | ERRO[61987] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:6931:a7b:775b:1:11c8:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:6931:a7b:775b:1:11c8:2]:5433: connect: connection refused source="postgres_exporter.go:1658"
2022-06-02T10:55:09.005 app[017ee4ea] iad [info] keeper | 8343122/18361529 kB (45%), 0/1 tablespace (/data/postgres/base/16388/35371 )
2022-06-02T10:55:09.963 app[8980e284] iad [info] keeper | 2022-06-02T10:55:09.936Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T10:55:09.963 app[8980e284] iad [info] keeper | 2022-06-02T10:55:09.952Z INFO	cmd/keeper.go:1542 already master
2022-06-02T10:55:10.005 app[017ee4ea] iad [info] keeper | 8497874/18361529 kB (46%), 0/1 tablespace (/data/postgres/base/16388/35371 )
2022-06-02T10:55:10.120 app[8980e284] iad [info] keeper | 2022-06-02T10:55:10.114Z INFO	cmd/keeper.go:1675 postgres parameters not changed
2022-06-02T10:55:10.133 app[8980e284] iad [info] keeper | 2022-06-02T10:55:10.131Z INFO	cmd/keeper.go:1702 postgres hba entries not changed
2022-06-02T10:55:11.006 app[017ee4ea] iad [info] keeper | 8659347/18361529 kB (47%), 0/1 tablespace (/data/postgres/base/16388/34804 )
2022-06-02T10:55:12.005 app[017ee4ea] iad [info] keeper | 8804467/18361529 kB (47%), 0/1 tablespace (/data/postgres/base/16388/34804 )
2022-06-02T10:55:12.579 app[8980e284] iad [info] [61996.032603] Out of memory: Killed process 725 (postgres) total-vm:155772kB, anon-rss:1844kB, file-rss:0kB, shmem-rss:66868kB, UID:1000 pgtables:272kB oom_score_adj:0
2022-06-02T10:55:12.592 app[8980e284] iad [info] [61996.045008] Out of memory: Killed process 724 (postgres) total-vm:156124kB, anon-rss:2184kB, file-rss:0kB, shmem-rss:65584kB, UID:1000 pgtables:272kB oom_score_adj:0
2022-06-02T10:55:12.710 app[017ee4ea] iad [info] keeper | pg_basebackup: error: could not receive data from WAL stream: server closed the connection unexpectedly

The cluster in question has now recovered, but I don’t know how much data was lost, as the import stopped halfway through.

1 Like

Update: I ran the import again it seems that I mixed up the order in which errors occurred the first round.

  1. Out of memory killer brings down replica (and or primary, not sure on that one)
  2. Replica enters recovery mode and therefor refuses to accept connections
  3. Cluster recovers once replica exits recovery mode
2022-06-02T12:18:08.169 app[8980e284] iad [info] [66971.489812] Out of memory: Killed process 24571 (postgres) total-vm:155744kB, anon-rss:1896kB, file-rss:0kB, shmem-rss:70988kB, UID:1000 pgtables:276kB oom_score_adj:0
2022-06-02T12:18:08.209 app[8980e284] iad [info] [66971.529506] Out of memory: Killed process 24570 (postgres) total-vm:156288kB, anon-rss:2312kB, file-rss:0kB, shmem-rss:61528kB, UID:1000 pgtables:272kB oom_score_adj:0
2022-06-02T12:18:08.292 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:08.287 UTC [8048] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
2022-06-02T12:18:08.297 app[8980e284] iad [info] keeper | 2022-06-02 12:18:08.284 UTC [722] LOG: background writer process (PID 24571) was terminated by signal 9: Killed
2022-06-02T12:18:08.297 app[8980e284] iad [info] keeper | 2022-06-02 12:18:08.284 UTC [722] LOG: terminating any other active server processes
2022-06-02T12:18:08.292 app[017ee4ea] iad [info] keeper | This probably means the server terminated abnormally
2022-06-02T12:18:08.292 app[017ee4ea] iad [info] keeper | before or while processing the request.
2022-06-02T12:18:08.312 app[8980e284] iad [info] keeper | 2022-06-02 12:18:08.312 UTC [23372] FATAL: the database system is in recovery mode
2022-06-02T12:18:08.313 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:08.313 UTC [4650] FATAL: could not connect to the primary server: connection to server at "fdaa:0:6931:a7b:775a:1:11c7:2", port 5433 failed: FATAL: the database system is in recovery mode
2022-06-02T12:18:08.347 app[8980e284] iad [info] keeper | 2022-06-02 12:18:08.341 UTC [23374] FATAL: the database system is in recovery mode
2022-06-02T12:18:08.359 app[8980e284] iad [info] keeper | 2022-06-02 12:18:08.358 UTC [23377] FATAL: the database system is in recovery mode
2022-06-02T12:18:08.629 app[8980e284] iad [info] keeper | 2022-06-02T12:18:08.625Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T12:18:08.632 app[8980e284] iad [info] keeper | 2022-06-02 12:18:08.631 UTC [23384] FATAL: the database system is in recovery mode
2022-06-02T12:18:09.216 app[8980e284] iad [info] keeper | 2022-06-02 12:18:09.216 UTC [23385] FATAL: the database system is in recovery mode
2022-06-02T12:18:09.976 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:09.975Z INFO	cmd/keeper.go:1556 our db requested role is standby {"followedDB": "36870ebc"}
2022-06-02T12:18:09.976 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:09.975Z INFO	cmd/keeper.go:1575 already standby
2022-06-02T12:18:09.995 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:09.995Z INFO	cmd/keeper.go:1675 postgres parameters not changed
2022-06-02T12:18:09.996 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:09.995Z INFO	cmd/keeper.go:1702 postgres hba entries not changed
2022-06-02T12:18:13.303 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:13.302 UTC [4682] FATAL: could not connect to the primary server: connection to server at "fdaa:0:6931:a7b:775a:1:11c7:2", port 5433 failed: FATAL: the database system is in recovery mode
2022-06-02T12:18:13.719 app[8980e284] iad [info] keeper | 2022-06-02T12:18:13.718Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T12:18:13.727 app[8980e284] iad [info] keeper | 2022-06-02T12:18:13.726Z ERROR	cmd/keeper.go:1021 failed to get replication slots {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:14.231 app[8980e284] iad [info] keeper | 2022-06-02 12:18:14.230 UTC [23395] FATAL: the database system is in recovery mode
2022-06-02T12:18:14.976 app[8980e284] iad [info] keeper | 2022-06-02 12:18:14.975 UTC [23398] FATAL: the database system is in recovery mode
2022-06-02T12:18:15.069 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:15.069Z INFO	cmd/keeper.go:1556 our db requested role is standby {"followedDB": "36870ebc"}
2022-06-02T12:18:15.070 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:15.069Z INFO	cmd/keeper.go:1575 already standby
2022-06-02T12:18:15.095 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:15.094Z INFO	cmd/keeper.go:1675 postgres parameters not changed
2022-06-02T12:18:15.095 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:15.094Z INFO	cmd/keeper.go:1702 postgres hba entries not changed
2022-06-02T12:18:15.356 app[8980e284] iad [info] keeper | 2022-06-02 12:18:15.354 UTC [23400] FATAL: the database system is in recovery mode
2022-06-02T12:18:15.358 app[8980e284] iad [info] keeper | 2022-06-02 12:18:15.357 UTC [23402] FATAL: the database system is in recovery mode
2022-06-02T12:18:16.735 app[8980e284] iad [info] keeper | 2022-06-02 12:18:16.734 UTC [23403] FATAL: the database system is in recovery mode
2022-06-02T12:18:16.736 app[8980e284] iad [info] keeper | 2022-06-02T12:18:16.735Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:18.304 app[8980e284] iad [info] keeper | 2022-06-02 12:18:18.303 UTC [23404] FATAL: the database system is in recovery mode
2022-06-02T12:18:18.304 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:18.303 UTC [4710] FATAL: could not connect to the primary server: connection to server at "fdaa:0:6931:a7b:775a:1:11c7:2", port 5433 failed: FATAL: the database system is in recovery mode
2022-06-02T12:18:18.781 app[8980e284] iad [info] keeper | 2022-06-02 12:18:18.780 UTC [23408] FATAL: the database system is in recovery mode
2022-06-02T12:18:18.835 app[8980e284] iad [info] keeper | 2022-06-02T12:18:18.834Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T12:18:18.836 app[8980e284] iad [info] keeper | 2022-06-02T12:18:18.836Z INFO	cmd/keeper.go:1542 already master
2022-06-02T12:18:18.841 app[8980e284] iad [info] keeper | 2022-06-02 12:18:18.840 UTC [23412] FATAL: the database system is in recovery mode
2022-06-02T12:18:18.841 app[8980e284] iad [info] keeper | 2022-06-02T12:18:18.841Z ERROR	cmd/keeper.go:1021 failed to get replication slots {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:18.841 app[8980e284] iad [info] keeper | 2022-06-02T12:18:18.841Z ERROR	cmd/keeper.go:1546 error updating replication slots {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:19.242 app[8980e284] iad [info] keeper | 2022-06-02 12:18:19.241 UTC [23413] FATAL: the database system is in recovery mode
2022-06-02T12:18:19.243 app[8980e284] iad [info] keeper | 2022-06-02T12:18:19.242Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:20.137 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:20.137Z INFO	cmd/keeper.go:1556 our db requested role is standby {"followedDB": "36870ebc"}
2022-06-02T12:18:20.137 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:20.137Z INFO	cmd/keeper.go:1575 already standby
2022-06-02T12:18:20.157 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:20.156Z INFO	cmd/keeper.go:1675 postgres parameters not changed
2022-06-02T12:18:20.157 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:20.157Z INFO	cmd/keeper.go:1702 postgres hba entries not changed
2022-06-02T12:18:21.746 app[8980e284] iad [info] keeper | 2022-06-02 12:18:21.745 UTC [23415] FATAL: the database system is in recovery mode
2022-06-02T12:18:21.747 app[8980e284] iad [info] keeper | 2022-06-02T12:18:21.746Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:22.358 app[8980e284] iad [info] keeper | 2022-06-02 12:18:22.357 UTC [23417] FATAL: the database system is in recovery mode
2022-06-02T12:18:22.362 app[8980e284] iad [info] keeper | 2022-06-02 12:18:22.361 UTC [23419] FATAL: the database system is in recovery mode
2022-06-02T12:18:23.310 app[8980e284] iad [info] keeper | 2022-06-02 12:18:23.308 UTC [23420] FATAL: the database system is in recovery mode
2022-06-02T12:18:23.310 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:23.310 UTC [4744] FATAL: could not connect to the primary server: connection to server at "fdaa:0:6931:a7b:775a:1:11c7:2", port 5433 failed: FATAL: the database system is in recovery mode
2022-06-02T12:18:23.946 app[8980e284] iad [info] keeper | 2022-06-02T12:18:23.946Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T12:18:23.947 app[8980e284] iad [info] keeper | 2022-06-02T12:18:23.947Z INFO	cmd/keeper.go:1542 already master
2022-06-02T12:18:23.948 app[8980e284] iad [info] keeper | 2022-06-02 12:18:23.947 UTC [23426] FATAL: the database system is in recovery mode
2022-06-02T12:18:23.949 app[8980e284] iad [info] keeper | 2022-06-02T12:18:23.949Z ERROR	cmd/keeper.go:1021 failed to get replication slots {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:23.950 app[8980e284] iad [info] keeper | 2022-06-02T12:18:23.949Z ERROR	cmd/keeper.go:1546 error updating replication slots {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:24.252 app[8980e284] iad [info] keeper | 2022-06-02 12:18:24.251 UTC [23427] FATAL: the database system is in recovery mode
2022-06-02T12:18:24.252 app[8980e284] iad [info] keeper | 2022-06-02T12:18:24.252Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:25.203 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:25.202Z INFO	cmd/keeper.go:1556 our db requested role is standby {"followedDB": "36870ebc"}
2022-06-02T12:18:25.203 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:25.203Z INFO	cmd/keeper.go:1575 already standby
2022-06-02T12:18:25.227 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:25.227Z INFO	cmd/keeper.go:1675 postgres parameters not changed
2022-06-02T12:18:25.227 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:25.227Z INFO	cmd/keeper.go:1702 postgres hba entries not changed
2022-06-02T12:18:26.758 app[8980e284] iad [info] keeper | 2022-06-02 12:18:26.757 UTC [23429] FATAL: the database system is in recovery mode
2022-06-02T12:18:26.759 app[8980e284] iad [info] keeper | 2022-06-02T12:18:26.758Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:27.352 app[017ee4ea] iad [info] proxy | [WARNING] 152/121827 (564) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5001ms. 0 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
2022-06-02T12:18:27.352 app[017ee4ea] iad [info] proxy | [ALERT] 152/121827 (564) : backend 'bk_db' has no server available!
2022-06-02T12:18:27.360 app[8980e284] iad [info] proxy | [WARNING] 152/121827 (565) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2022-06-02T12:18:27.360 app[8980e284] iad [info] proxy | [ALERT] 152/121827 (565) : backend 'bk_db' has no server available!
2022-06-02T12:18:27.815 app[8980e284] iad [info] exporter | INFO[66990] Established new database connection to "fdaa:0:6931:a7b:775a:1:11c7:2:5433". source="postgres_exporter.go:970"
2022-06-02T12:18:27.818 app[8980e284] iad [info] keeper | 2022-06-02 12:18:27.817 UTC [23430] FATAL: the database system is in recovery mode
2022-06-02T12:18:28.312 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:28.312 UTC [4772] FATAL: could not connect to the primary server: connection to server at "fdaa:0:6931:a7b:775a:1:11c7:2", port 5433 failed: FATAL: the database system is in recovery mode
2022-06-02T12:18:28.312 app[8980e284] iad [info] keeper | 2022-06-02 12:18:28.311 UTC [23431] FATAL: the database system is in recovery mode
2022-06-02T12:18:28.823 app[8980e284] iad [info] exporter | ERRO[66991] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:6931:a7b:775a:1:11c7:2]:5433/postgres?sslmode=disable): pq: the database system is in recovery mode source="postgres_exporter.go:1658"
2022-06-02T12:18:29.038 app[8980e284] iad [info] keeper | 2022-06-02T12:18:29.037Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T12:18:29.043 app[8980e284] iad [info] keeper | 2022-06-02T12:18:29.042Z INFO	cmd/keeper.go:1542 already master
2022-06-02T12:18:29.047 app[8980e284] iad [info] keeper | 2022-06-02 12:18:29.045 UTC [23437] FATAL: the database system is in recovery mode
2022-06-02T12:18:29.047 app[8980e284] iad [info] keeper | 2022-06-02T12:18:29.046Z ERROR	cmd/keeper.go:1021 failed to get replication slots {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:29.047 app[8980e284] iad [info] keeper | 2022-06-02T12:18:29.046Z ERROR	cmd/keeper.go:1546 error updating replication slots {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:29.265 app[8980e284] iad [info] keeper | 2022-06-02 12:18:29.263 UTC [23438] FATAL: the database system is in recovery mode
2022-06-02T12:18:29.265 app[8980e284] iad [info] keeper | 2022-06-02T12:18:29.264Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:29.365 app[8980e284] iad [info] keeper | 2022-06-02 12:18:29.363 UTC [23440] FATAL: the database system is in recovery mode
2022-06-02T12:18:29.366 app[8980e284] iad [info] keeper | 2022-06-02 12:18:29.365 UTC [23442] FATAL: the database system is in recovery mode
2022-06-02T12:18:30.277 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:30.276Z INFO	cmd/keeper.go:1556 our db requested role is standby {"followedDB": "36870ebc"}
2022-06-02T12:18:30.277 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:30.276Z INFO	cmd/keeper.go:1575 already standby
2022-06-02T12:18:30.301 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:30.301Z INFO	cmd/keeper.go:1675 postgres parameters not changed
2022-06-02T12:18:30.302 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:30.301Z INFO	cmd/keeper.go:1702 postgres hba entries not changed
2022-06-02T12:18:31.770 app[8980e284] iad [info] keeper | 2022-06-02 12:18:31.768 UTC [23444] FATAL: the database system is in recovery mode
2022-06-02T12:18:31.770 app[8980e284] iad [info] keeper | 2022-06-02T12:18:31.769Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:32.503 app[8980e284] iad [info] sentinel | 2022-06-02T12:18:32.493Z INFO	cmd/sentinel.go:995 master db is failed {"db": "36870ebc", "keeper": "775a111c72"}
2022-06-02T12:18:32.503 app[8980e284] iad [info] sentinel | 2022-06-02T12:18:32.502Z INFO	cmd/sentinel.go:1006 trying to find a new master to replace failed master
2022-06-02T12:18:32.504 app[8980e284] iad [info] sentinel | 2022-06-02T12:18:32.504Z INFO	cmd/sentinel.go:1032 electing db as the new master {"db": "f3e7a294", "keeper": "775b111c82"}
2022-06-02T12:18:33.316 app[8980e284] iad [info] keeper | 2022-06-02 12:18:33.315 UTC [23445] FATAL: the database system is in recovery mode
2022-06-02T12:18:33.316 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:33.316 UTC [4804] FATAL: could not connect to the primary server: connection to server at "fdaa:0:6931:a7b:775a:1:11c7:2", port 5433 failed: FATAL: the database system is in recovery mode
2022-06-02T12:18:34.139 app[8980e284] iad [info] keeper | 2022-06-02T12:18:34.138Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T12:18:34.144 app[8980e284] iad [info] keeper | 2022-06-02T12:18:34.143Z INFO	cmd/keeper.go:1542 already master
2022-06-02T12:18:34.146 app[8980e284] iad [info] keeper | 2022-06-02 12:18:34.145 UTC [23451] FATAL: the database system is in recovery mode
2022-06-02T12:18:34.146 app[8980e284] iad [info] keeper | 2022-06-02T12:18:34.146Z ERROR	cmd/keeper.go:1546 error updating replication slots {"error": "pq: the database system is in recovery mode"
2022-06-02T12:18:34.146 app[8980e284] iad [info] keeper | 2022-06-02T12:18:34.146Z ERROR	cmd/keeper.go:1546 error updating replication slots {"error": "pq: the database system is in recovery mode"
2022-06-02T12:18:34.277 app[8980e284] iad [info] keeper | 2022-06-02 12:18:34.276 UTC [23452] FATAL: the database system is in recovery mode
2022-06-02T12:18:34.278 app[8980e284] iad [info] keeper | 2022-06-02T12:18:34.277Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:34.984 app[8980e284] iad [info] keeper | 2022-06-02 12:18:34.983 UTC [23455] FATAL: the database system is in recovery mode
2022-06-02T12:18:35.351 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:35.351Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T12:18:35.352 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:35.352Z INFO	cmd/keeper.go:1536 promoting to master
2022-06-02T12:18:35.353 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:35.352Z INFO	postgresql/postgresql.go:532 promoting database
2022-06-02T12:18:35.354 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:35.353 UTC [8044] LOG: received promote request
2022-06-02T12:18:35.354 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:35.353 UTC [8044] LOG: redo done at 6/53FB9FA8 system usage: CPU: user: 24.18 s, system: 82.45 s, elapsed: 4882.90 s
2022-06-02T12:18:35.354 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:35.353 UTC [8044] LOG: last completed transaction was at log time 2022-06-02 12:18:07.717506+00
2022-06-02T12:18:35.357 app[017ee4ea] iad [info] keeper | waiting for server to promote....2022-06-02 12:18:35.357 UTC [8044] LOG: selected new timeline ID: 3
2022-06-02T12:18:35.393 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:35.392 UTC [8044] LOG: archive recovery complete
2022-06-02T12:18:35.426 app[017ee4ea] iad [info] keeper | 2022-06-02 12:18:35.425 UTC [8043] LOG: database system is ready to accept connections
2022-06-02T12:18:35.455 app[017ee4ea] iad [info] keeper | done
2022-06-02T12:18:35.455 app[017ee4ea] iad [info] keeper | server promoted
2022-06-02T12:18:35.506 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:35.505Z INFO	cmd/keeper.go:1675 postgres parameters not changed
2022-06-02T12:18:35.506 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:35.506Z INFO	cmd/keeper.go:1702 postgres hba entries not changed
2022-06-02T12:18:36.368 app[8980e284] iad [info] keeper | 2022-06-02 12:18:36.367 UTC [23457] FATAL: the database system is in recovery mode
2022-06-02T12:18:36.369 app[8980e284] iad [info] keeper | 2022-06-02 12:18:36.369 UTC [23459] FATAL: the database system is in recovery mode
2022-06-02T12:18:36.784 app[8980e284] iad [info] keeper | 2022-06-02 12:18:36.783 UTC [23460] FATAL: the database system is in recovery mode
2022-06-02T12:18:36.785 app[8980e284] iad [info] keeper | 2022-06-02T12:18:36.784Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:39.015 app[8980e284] iad [info] proxy | [WARNING] 152/121839 (565) : Server bk_db/pg2 is UP, reason: Layer7 check passed, code: 200, check duration: 16ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
2022-06-02T12:18:39.016 app[017ee4ea] iad [info] proxy | [WARNING] 152/121839 (564) : Server bk_db/pg2 is UP, reason: Layer7 check passed, code: 200, check duration: 17ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
2022-06-02T12:18:39.279 app[8980e284] iad [info] keeper | 2022-06-02T12:18:39.276Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T12:18:39.285 app[8980e284] iad [info] keeper | 2022-06-02T12:18:39.284Z INFO	cmd/keeper.go:1542 already master
2022-06-02T12:18:39.288 app[8980e284] iad [info] keeper | 2022-06-02 12:18:39.287 UTC [23466] FATAL: the database system is in recovery mode
2022-06-02T12:18:39.289 app[8980e284] iad [info] keeper | 2022-06-02T12:18:39.289Z ERROR	cmd/keeper.go:1021 failed to get replication slots {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:39.289 app[8980e284] iad [info] keeper | 2022-06-02T12:18:39.289Z ERROR	cmd/keeper.go:1546 error updating replication slots {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:39.290 app[8980e284] iad [info] keeper | 2022-06-02 12:18:39.289 UTC [23467] FATAL: the database system is in recovery mode
2022-06-02T12:18:39.292 app[8980e284] iad [info] keeper | 2022-06-02T12:18:39.290Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:40.558 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:40.557Z INFO	cmd/keeper.go:1504 our db requested role is master
2022-06-02T12:18:40.559 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:40.558Z INFO	cmd/keeper.go:1542 already master
2022-06-02T12:18:40.585 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:40.584Z INFO	cmd/keeper.go:1675 postgres parameters not changed
2022-06-02T12:18:40.585 app[017ee4ea] iad [info] keeper | 2022-06-02T12:18:40.585Z INFO	cmd/keeper.go:1702 postgres hba entries not changed
2022-06-02T12:18:41.796 app[8980e284] iad [info] keeper | 2022-06-02 12:18:41.795 UTC [23469] FATAL: the database system is in recovery mode
2022-06-02T12:18:41.797 app[8980e284] iad [info] keeper | 2022-06-02T12:18:41.796Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "pq: the database system is in recovery mode"}
2022-06-02T12:18:42.702 app[8980e284] iad [info] sentinel | 2022-06-02T12:18:42.697Z INFO	cmd/sentinel.go:1151 removing old master db {"db": "36870ebc", "keeper": "775a111c72"}
2022-06-02T12:18:42.702 app[8980e284] iad [info] sentinel | 2022-06-02T12:18:42.701Z INFO	cmd/sentinel.go:1503 added new standby db {"db": "47b2ae72", "keeper": "775a111c72"}
2022-06-02T12:18:42.813 app[8980e284] iad [info] exporter | INFO[67005] Established new database connection to "fdaa:0:6931:a7b:775a:1:11c7:2:5433". source="postgres_exporter.go:970" 

I’m adding more RAM and attempting the import again, will update with my findings.

Update: Added more RAM, still failing. Adding even more RAM.

Port 5433 connects directly to postgres, bypassing the built in proxy. This will connect you to the leader or a replica. It seems like the queries you’re running are running both nodes out of memory.

More memory is definitely the fix!

Yup, insufficient memory did end up being the issue. I initially assumed the root cause to be related to intermittent connection issues, because I got the order of events wrong.

1 Like

Nice! I appreciate you reading the logs and digging into this, good find.

1 Like