Consistently getting disconnected from postgres over wireguared

I’m restoring a large db dump locally to my fly postgres. Every time I get disconnected before completing:

pg_restore: error: error returned by PQputCopyData: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

In this case it was after about 7 minutes.

Tried again, and got disconnected again after 17 minutes

Between this and the remote builder issues, is seems like there’s some weirdness between your local network and ours.

First off, when postgres disconnects, is wireguard still showing connected?

Second, what region are you using for wireguard? And can you see what FLY_REGION shows on debug.fly.dev?

To connect, I ran wg-quick up with the config file from fly wireguard create. FLY_REGION=dfw, while the gateway is in iad. I’m not sure where to look if wireguard is connected – wg-quick just goes into the background. I definitely am not manually reconnecting though. Do you know where I should look for logs?

wg show should give you some information.

It might worth running a ping against a postgres IP while you’re doing this. flyctl ips private -a <postgres-cluster-name> will give you an IP, and then ping6 <ip>. If wireguard is flapping, these pings will fail at some point. If pings don’t fail, there’s likely something else up (like postgres instances restarting or something).

interface: foo
public key: M+kIXLaqvuxMdIBC9fKXTDWaNuQ099aKxFWy0t07b28=
private key: (hidden)
listening port: 56380

peer: R2in3C4C5I1AVyoSrmsOgSkhPDKAegwUg6zwkLrhryk=
endpoint: 104.225.8.204:51820
allowed ips: fdaa:0:1272::/48
latest handshake: 1 minute, 56 seconds ago
transfer: 73.00 MiB received, 2.42 GiB sent
persistent keepalive: every 15 seconds

I don’t think the db is getting restarted either, I’ve been keeping an eye on those logs and didn’t see anything

I did notice this ‘failing’ health check:

2021-03-18T21:41:38.403Z d6101e99 dfw [info] [HEALTHCHECK] [vm: failing] [✗] system spent 6.2 of the last 10 seconds waiting for cpu
[✓] 138.7 GB (94.6%) free space on /data/
[✓] load averages: 0.09 0.11 0.24
[✓] memory: 0.0s waiting over the last 60s
[✓] io: 0.1s waiting over the last 60s

ahh, here we go:

2021-03-18T21:41:38.403Z d6101e99 dfw [info] [HEALTHCHECK] [vm: failing] [✗] system spent 6.2 of the last 10 seconds waiting for cpu
[✓] 138.7 GB (94.6%) free space on /data/
[✓] load averages: 0.09 0.11 0.24
[✓] memory: 0.0s waiting over the last 60s
[✓] io: 0.1s waiting over the last 60s

2021-03-18T21:46:19.201Z d6101e99 dfw [info] Shutting down virtual machine
2021-03-18T21:46:19.253Z d6101e99 dfw [info] Sending signal SIGTERM to main child process w/ PID 511
2021-03-18T21:46:19.266Z d6101e99 dfw [info] Reaped child process with pid: 562 and signal: SIGHUP, core dumped? false
2021-03-18T21:46:19.267Z d6101e99 dfw [info] keeper            | Interrupting...
2021-03-18T21:46:19.268Z d6101e99 dfw [info] sentinel          | Interrupting...
2021-03-18T21:46:19.269Z d6101e99 dfw [info] proxy             | Interrupting...
2021-03-18T21:46:19.270Z d6101e99 dfw [info] postgres_exporter | Interrupting...
2021-03-18T21:46:19.288Z d6101e99 dfw [info] sentinel          | 2021-03-18T21:46:18.980Z	INFO	cmd/sentinel.go:1816	stopping stolon sentinel
2021-03-18T21:46:19.352Z d6101e99 dfw [info] sentinel          | Exited
2021-03-18T21:46:19.353Z d6101e99 dfw [info] proxy             | Exited
2021-03-18T21:46:19.353Z d6101e99 dfw [info] postgres_exporter | Exited
2021-03-18T21:46:19.435Z 1a3f2306 dfw [info] keeper            | 2021-03-18 21:46:19.263 UTC [617] LOG:  replication terminated by primary server
2021-03-18T21:46:19.437Z 1a3f2306 dfw [info] keeper            | 2021-03-18 21:46:19.263 UTC [617] DETAIL:  End of WAL reached on timeline 2 at 0/BE603A70.
2021-03-18T21:46:19.445Z 1a3f2306 dfw [info] keeper            | 2021-03-18 21:46:19.272 UTC [617] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
2021-03-18T21:46:19.447Z 1a3f2306 dfw [info] keeper            | 2021-03-18 21:46:19.273 UTC [606] LOG:  invalid record length at 0/BE603A70: wanted 24, got 0
2021-03-18T21:46:19.452Z 1a3f2306 dfw [info] keeper            | 2021-03-18 21:46:19.282 UTC [9231] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
2021-03-18T21:46:19.453Z 1a3f2306 dfw [info] keeper            | 		Is the server running on host "fdaa:0:1272:a7b:7b:0:12e0:2" and accepting
2021-03-18T21:46:19.455Z 1a3f2306 dfw [info] keeper            | 		TCP/IP connections on port 5433?
2021-03-18T21:46:20.269Z d6101e99 dfw [info] Reaped child process with pid: 15381 and signal: SIGPIPE, core dumped? false
2021-03-18T21:46:20.271Z d6101e99 dfw [info] Reaped child process with pid: 616, exit code: 0
2021-03-18T21:46:20.359Z d6101e99 dfw [info] keeper            | Exited
2021-03-18T21:46:22.275Z d6101e99 dfw [info] Main child exited normally with code: 0
2021-03-18T21:46:22.277Z d6101e99 dfw [info] Reaped child process with pid: 559, exit code: 0
2021-03-18T21:46:22.294Z d6101e99 dfw [info] Umounting /dev/vdc from /data
2021-03-18T21:46:25.132Z d6101e99 dfw [info] Starting instance
2021-03-18T21:46:25.173Z d6101e99 dfw [info] Configuring virtual machine
2021-03-18T21:46:25.175Z d6101e99 dfw [info] Pulling container image
2021-03-18T21:46:25.648Z d6101e99 dfw [info] Unpacking image
2021-03-18T21:46:25.663Z d6101e99 dfw [info] Preparing kernel init
2021-03-18T21:46:25.832Z d6101e99 dfw [info] Setting up volume 'pg_data'
2021-03-18T21:46:26.142Z d6101e99 dfw [info] Configuring firecracker
2021-03-18T21:46:26.189Z d6101e99 dfw [info] Starting virtual machine
2021-03-18T21:46:26.403Z d6101e99 dfw [info] Starting init (commit: c1dec69)...
2021-03-18T21:46:26.423Z d6101e99 dfw [info] Mounting /dev/vdc at /data
2021-03-18T21:46:26.430Z d6101e99 dfw [info] Running: `docker-entrypoint.sh /fly/start.sh` as root
2021-03-18T21:46:26.453Z d6101e99 dfw [info] 2021/03/18 21:46:26 listening on [fdaa:0:1272:a7b:7b:0:12e0:2]:22 (DNS: [fdaa::3]:53)
2021-03-18T21:46:26.601Z d6101e99 dfw [info] system            | Tmux socket name: overmind-fly-ekdPm1GwWl0sQQu1uX2e2H
2021-03-18T21:46:26.602Z d6101e99 dfw [info] system            | Tmux session ID: fly
2021-03-18T21:46:26.603Z d6101e99 dfw [info] system            | Listening at ./.overmind.sock
2021-03-18T21:46:26.703Z d6101e99 dfw [info] sentinel          | Started with pid 559...
2021-03-18T21:46:26.704Z d6101e99 dfw [info] proxy             | Started with pid 561...
2021-03-18T21:46:26.705Z d6101e99 dfw [info] postgres_exporter | Started with pid 564...
2021-03-18T21:46:26.706Z d6101e99 dfw [info] keeper            | Started with pid 556...
2021-03-18T21:46:26.782Z d6101e99 dfw [info] postgres_exporter | INFO[0000] Starting Server: :9187                        source="postgres_exporter.go:1825"
2021-03-18T21:46:26.932Z d6101e99 dfw [info] sentinel          | 2021-03-18T21:46:26.915Z	INFO	cmd/sentinel.go:2000	sentinel uid	id": "f1ddc9d1"}
2021-03-18T21:46:28.544Z 1a3f2306 dfw [info] keeper            | 2021-03-18 21:46:28.369 UTC [9251] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
2021-03-18T21:46:28.548Z 1a3f2306 dfw [info] keeper            | 		Is the server running on host "fdaa:0:1272:a7b:7b:0:12e0:2" and accepting
2021-03-18T21:46:28.549Z 1a3f2306 dfw [info] keeper            | 		TCP/IP connections on port 5433?
2021-03-18T21:46:28.903Z d6101e99 dfw [info] sentinel          | 2021-03-18T21:46:28.887Z	INFO	cmd/sentinel.go:82	Trying to acquire sentinels leadership
2021-03-18T21:46:28.911Z d6101e99 dfw [info] keeper            | 2021-03-18T21:46:28.895Z	ERROR	cmd/keeper.go:688	cannot get configured pg parameters	rror": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-03-18T21:46:29.455Z 1a3f2306 dfw [info] keeper            | 2021-03-18 21:46:29.283 UTC [9273] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
2021-03-18T21:46:29.457Z 1a3f2306 dfw [info] keeper            | 		Is the server running on host "fdaa:0:1272:a7b:7b:0:12e0:2" and accepting
2021-03-18T21:46:29.458Z 1a3f2306 dfw [info] keeper            | 		TCP/IP connections on port 5433?
2021-03-18T21:46:31.421Z d6101e99 dfw [info] keeper            | 2021-03-18T21:46:31.400Z	ERROR	cmd/keeper.go:688	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-03-18T21:46:33.922Z d6101e99 dfw [info] keeper            | 2021-03-18T21:46:33.901Z	ERROR	cmd/keeper.go:688	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-03-18T21:46:34.451Z 1a3f2306 dfw [info] keeper            | 2021-03-18 21:46:34.282 UTC [9303] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
2021-03-18T21:46:34.453Z 1a3f2306 dfw [info] keeper            | 		Is the server running on host "fdaa:0:1272:a7b:7b:0:12e0:2" and accepting
2021-03-18T21:46:34.454Z 1a3f2306 dfw [info] keeper            | 		TCP/IP connections on port 5433?
2021-03-18T21:46:35.065Z 1a3f2306 dfw [info] sentinel          | 2021-03-18T21:46:34.893Z	INFO	cmd/sentinel.go:89	sentinel leadership acquired
2021-03-18T21:46:36.422Z d6101e99 dfw [info] keeper            | 2021-03-18T21:46:36.402Z	ERROR	cmd/keeper.go:688	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-03-18T21:46:38.925Z d6101e99 dfw [info] keeper            | 2021-03-18T21:46:38.903Z	ERROR	cmd/keeper.go:688	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-03-18T21:46:39.457Z 1a3f2306 dfw [info] keeper            | 2021-03-18 21:46:39.285 UTC [9321] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
2021-03-18T21:46:39.460Z 1a3f2306 dfw [info] keeper            | 		Is the server running on host "fdaa:0:1272:a7b:7b:0:12e0:2" and accepting
2021-03-18T21:46:39.461Z 1a3f2306 dfw [info] keeper            | 		TCP/IP connections on port 5433?
2021-03-18T21:46:41.425Z d6101e99 dfw [info] keeper            | 2021-03-18T21:46:41.404Z	ERROR	cmd/keeper.go:688	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-03-18T21:46:43.926Z d6101e99 dfw [info] keeper            | 2021-03-18T21:46:43.905Z	ERROR	cmd/keeper.go:688	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2021-03-18T21:46:44.190Z d6101e99 dfw [info] keeper            | 2021-03-18 21:46:44.169 UTC [614] LOG:  starting PostgreSQL 12.5 (Debian 12.5-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2021-03-18T21:46:44.193Z d6101e99 dfw [info] keeper            | 2021-03-18 21:46:44.176 UTC [614] LOG:  listening on IPv6 address "fdaa:0:1272:a7b:7b:0:12e0:2", port 5433
2021-03-18T21:46:44.197Z d6101e99 dfw [info] keeper            | 2021-03-18 21:46:44.181 UTC [614] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5433"
2021-03-18T21:46:44.224Z d6101e99 dfw [info] keeper            | 2021-03-18 21:46:44.208 UTC [615] LOG:  database system was shut down at 2021-03-18 21:46:19 UTC
2021-03-18T21:46:44.233Z d6101e99 dfw [info] keeper            | 2021-03-18 21:46:44.218 UTC [614] LOG:  database system is ready to accept connections
2021-03-18T21:46:44.466Z 1a3f2306 dfw [info] keeper            | 2021-03-18 21:46:44.295 UTC [9343] LOG:  started streaming WAL from primary at 0/BE000000 on timeline 2
2021-03-18T21:47:00.439Z d6101e99 dfw [info] postgres_exporter | INFO[0033] Established new database connection to "fdaa:0:1272:a7b:7b:0:12e0:2:5433".  source="postgres_exporter.go:968"
2021-03-18T21:47:00.451Z d6101e99 dfw [info] postgres_exporter | INFO[0033] Semantic Version Changed on "fdaa:0:1272:a7b:7b:0:12e0:2:5433": 0.0.0 -> 12.5.0  source="postgres_exporter.go:1527"
2021-03-18T21:47:00.481Z d6101e99 dfw [info] postgres_exporter | INFO[0033] Established new database connection to "fdaa:0:1272:a7b:7b:0:12e0:2:5433".  source="postgres_exporter.go:968"
2021-03-18T21:47:00.503Z d6101e99 dfw [info] postgres_exporter | INFO[0033] Semantic Version Changed on "fdaa:0:1272:a7b:7b:0:12e0:2:5433": 0.0.0 -> 12.5.0  source="postgres_exporter.go:1527"
2021-03-18T21:47:08.492Z d6101e99 dfw [info] [HEALTHCHECK] [vm: passing] [✓] 138.58 GB (94.5%) free space on /data/
[✓] load averages: 0.00 0.00 0.00
[✓] memory: 0.0s waiting over the last 60s
[✓] cpu: 0.3s waiting over the last 60s
[✓] io: 0.7s waiting over the last 60s

This error means there’s not enough CPU for the import:

[✗] system spent 6.2 of the last 10 seconds waiting for cpu

It’s likely that postgres is just crumbling under the import load. You can upgrade to larger vms with flyctl scale vm <size>.

You should also connect to the primary node on port 5433 for imports like this. Port 5432 is a proxy that always sends you to primary, which is typically fine, but during an import it’s likely eating up more resources than you’d want.

It’s also worth configuring alerts for yourself, you can get health checks piped into slack or pagerduty.