2024-04-22T11:09:23.341 proxy[5683d922f4dd28] ams [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Transport endpoint is not connected (os error 107))
2024-04-22T11:09:48.233 app[5683d922f4dd28] ams [info] monitor | [WARN] Failed to restart haproxy on member fdaa:2:1180:a7b:39:8e69:7332:2: Get "http://[fdaa:2:1180:a7b:39:8e69:7332:2]:5500/commands/admin/haproxy/restart": dial tcp [fdaa:2:1180:a7b:39:8e69:7332:2]:5500: i/o timeout
2024-04-22T11:09:48.233 app[5683d922f4dd28] ams [info] monitor | clusterStateMonitorTick failed with: primary has been quarantined: unable to confirm we are the true primary
2024-04-22T11:09:49.151 app[5683d922f4dd28] ams [info] failed post-init: unrecoverable zombie. Retrying...
2024-04-22T11:09:49.151 app[5683d922f4dd28] ams [info] [ERROR] Manual intervention required.
2024-04-22T11:09:49.151 app[5683d922f4dd28] ams [info] [ERROR] If a new primary has been established, consider adding a new replica with `fly machines clone <primary-machine-id>` and then remove this member.
When i connect to the app via ssh i can see a zombie.lock which resolves the issue for a couple of seconds. Afterwards its ending up in the same state again.
I think my db app crashed because it run out of memory but eventho i raised the memory limit (which was never full anyways) it is still not working. Any help much appreciated.
I’m also receiving the same proxy error from my Fly cluster. I noticed that PP02 is not documented here: Fly.io Error Codes · Fly Docs
My own investigation suggests that PP02 occurs when pg clients connected to a Postgres cluster are forecefully disconnected by Fly’s postgres proxy after 30 minutes of idle time.
Clients that expect > 30 minutes of idle time should be resilient to server-side disconnects.
When i clone the machine and have two instances it works, but as soon as i destroy one of them, it stops working again (I tried both scenarios of destroying the new replica as well as destroying the original master instance, in both cases same issue, as soon as the numbers goes down to one, it doesn’t work anymore). I’ve been running a single machine for 4 months now, this is the first time this is happening in this way
Hey Shaun,
It suddenly started working again so no problem anymore. But still our app was down for 3hrs and I have no idea what caused it and what i could do to prevent it from happening again or at least how i would fix this. Any ideas?
2024-04-24T18:56:30.049 proxy[4d891972f279d8] sjc [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Connection reset by peer (os error 104))
This is occurring much more frequently today than in the past. These are not happening on idle clients