Postgres App not working anymore

Postgres App not working anymore out of nowhere.

2024-04-22T11:09:23.341 proxy[5683d922f4dd28] ams [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Transport endpoint is not connected (os error 107))

2024-04-22T11:09:48.233 app[5683d922f4dd28] ams [info] monitor | [WARN] Failed to restart haproxy on member fdaa:2:1180:a7b:39:8e69:7332:2: Get "http://[fdaa:2:1180:a7b:39:8e69:7332:2]:5500/commands/admin/haproxy/restart": dial tcp [fdaa:2:1180:a7b:39:8e69:7332:2]:5500: i/o timeout

2024-04-22T11:09:48.233 app[5683d922f4dd28] ams [info] monitor | clusterStateMonitorTick failed with: primary has been quarantined: unable to confirm we are the true primary

2024-04-22T11:09:49.151 app[5683d922f4dd28] ams [info] failed post-init: unrecoverable zombie. Retrying...

2024-04-22T11:09:49.151 app[5683d922f4dd28] ams [info] [ERROR] Manual intervention required.

2024-04-22T11:09:49.151 app[5683d922f4dd28] ams [info] [ERROR] If a new primary has been established, consider adding a new replica with `fly machines clone <primary-machine-id>` and then remove this member.

When i connect to the app via ssh i can see a zombie.lock which resolves the issue for a couple of seconds. Afterwards its ending up in the same state again.
I think my db app crashed because it run out of memory but eventho i raised the memory limit (which was never full anyways) it is still not working. Any help much appreciated.

Hey there,

I just took a look at your app and it looks like it’s currently in good shape. Do you still need assistance here?

As a side note, it looks like you have quite a few unused volumes tied to your app.

1 Like

I’m also receiving the same proxy error from my Fly cluster. I noticed that PP02 is not documented here: Fly.io Error Codes · Fly Docs

My own investigation suggests that PP02 occurs when pg clients connected to a Postgres cluster are forecefully disconnected by Fly’s postgres proxy after 30 minutes of idle time.

Clients that expect > 30 minutes of idle time should be resilient to server-side disconnects.

2 Likes

I am having a similar issue today, not entirely sure what’s causing this.

When i clone the machine and have two instances it works, but as soon as i destroy one of them, it stops working again (I tried both scenarios of destroying the new replica as well as destroying the original master instance, in both cases same issue, as soon as the numbers goes down to one, it doesn’t work anymore). I’ve been running a single machine for 4 months now, this is the first time this is happening in this way

Hey Shaun,
It suddenly started working again so no problem anymore. But still our app was down for 3hrs and I have no idea what caused it and what i could do to prevent it from happening again or at least how i would fix this. Any ideas?

Ps: thanks, i cleaned them up.

I am also seeing these errors:

2024-04-24T18:56:30.049 proxy[4d891972f279d8] sjc [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Connection reset by peer (os error 104))

This is occurring much more frequently today than in the past. These are not happening on idle clients

1 Like

Does anyone from fly.io have an explanation for these errors?

Hey @wobbleburger

We had an incident around that time which caused connectivity problems in multiple regions, including sjc: Fly.io Status - Elevated errors and connectivity problems

Looking at your app’s logs, the errors happened during that incident.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.