I had 3 machines in this cluster. Removed 1. Then couldnt do much else because lack of cuorum. Eventually was able to delete the primary machine via the dashboard in hopes the replica would become the primary after restarting.
No luck.
Cant connect with proxy: tunnel unavailable.
Cant reset wireguard: no such organization.
Cant failover the replica: failover is not available for standalone postgres.
Meanwhile my app cant connect to it and I’m getting errors left and right.
Tried ssh into it to do a pg_dump but i get: connection to server on socket error. No such file or directory.
What can i do to prevent loosing data and make the cluster run fine with a single node for now?
For the wireguard connection, can you try running fly agent stop followed by fly agent start and/or fly agent restart ? After those are you able to proxy in or use SSH? If not fly doctor might show some more details about what’s failing with the WG connection to your org.
For the cluster, if you still have the volume from your former primary, the quickest recovery option is likely going to be creating a new cluster from a volume fork: Fork a volume from a Postgres app · Fly Docs
In order to recover the existing cluster, you’ll likely need to connect to the machine and do some manual recovery work with the Repmgr cli tool in order to promote your node to primary. Exact steps would depend on what state the cluster thinks it’s in. repmgr cluster show will give you the overview of the cluster. Common recovery steps would be to unregister any dead nodes and promote the replica
(Note: the above assumes you have a postgres-flex cluster. If it’s an older stolon based cluster the general process is the same, but using Stolonctl commands vs repmgr)
Destroying a primary machine when the cluster isn’t at quorum / is unhealthy is always going to cause issues for failing over the cluster. If the replica machine was caught up to the primary, it likely will have all the data, but if there was anything not yet replicated when the primary went down that may be lost.
I would avoid destroying the source volumes from the old primary until you have a healthy cluster up (a new one or the original) and confirmed all your data is there
Did the agent stop and then the agent start. Then tried the proxy again and it didn’t work. Tunnel unavailable.
fly doctor
Testing authentication token... PASSED
Testing flyctl agent... PASSED
Testing local Docker instance... PASSED
Pinging WireGuard gateway (give us a sec)... FAILED
(Error: wireguard ping gateway: pinger: no such organization)
We can't establish connectivity with WireGuard for your personal organization.
WireGuard runs on 51820/udp, which your local network may block.
If this is the first time you've ever used 'flyctl' on this machine, you
can try running 'flyctl doctor' again.
If this was working before, you can ask 'flyctl' to create a new peer for
you by running 'flyctl wireguard reset'.
If your network might be blocking UDP, you can run 'flyctl wireguard websockets enable',
followed by 'flyctl agent restart', and we'll run WireGuard over HTTPS.
I also happened to try fly ping [db-app].internal and it does ping just fine.
I’ll check what i can do regarding the cluster. I do have all the volumes around still. Wise me thought it wouldn’t be ideal to remove those until everything was fine.
And the fly checks list for my db app is as follows:
NAME | STATUS | MACHINE | LAST UPDATED | OUTPUT
-------*---------*----------------*--------------*--------------------------------------------------------------------------
pg | passing | REDACTED | 13m25s ago | [✓] connections: 5 used, 3 reserved, 300 max (28.88ms)
-------*---------*----------------*--------------*--------------------------------------------------------------------------
role | passing | REDACTED | 13m19s ago | replica
-------*---------*----------------*--------------*--------------------------------------------------------------------------
vm | passing | REDACTED | 13m25s ago | [✓] checkDisk: 36.98 GB (94.7%) free space on /data/ (69.8µs)
| | | | [✓] checkLoad: load averages: 0.00 0.00 0.00 (93.35µs)
| | | | [✓] memory: system spent 0s of the last 60s waiting on memory (80.86µs)
| | | | [✓] cpu: system spent 0s of the last 60s waiting on cpu (56.16µs)
| | | | [✓] io: system spent 0s of the last 60s waiting on io (29.89µs)
-------*---------*----------------*--------------*--------------------------------------------------------------------------
Everything appeasrs to be fine except that it has no leader.
Yep, if the cluster is under the quorum, it won’t be able to elect a new leader should the primary disappear. This is for split-brain protection, preventing both nodes electing themselves primary in case both are active but just can’t reach each other.
If you want to restore that cluster, then you’ll need to do the manual repmgr work to clean up any already-removed nodes from the cluster state and promote the replica. Forking from the volume last attached to your primary is probably still going to be faster than the manual repmgr work however. Forking the cluster also shouldn’t require WG, which should avoid the other issues.
It looks like there’s a bug in v0.3.214 that’s preventing resetting WG connections properly. We’re working on a fix. In the meantime, if you manually install the previous version with curl -L https://fly.io/install.sh | sh -s 0.3.213 then run the fly wg reset or agent commands that should get you unblocked with that.
If you previously installed flyctl via package manager (homebrew etc.) you might need to un-install via that first, to avoid conflicts between the manual script install and the pkg manager installed version
This still returns a list of 3, even though only 1 is attached to a VM. But I believe I know which volume was attached to the former primary. Copied it, then ran:
Error: Failed to resolve the specified fork-from volume vol_id: failed to get volume vol_id: request returned non-2xx status: 404: 404 page not found
🤷♂
That’s expected, it’ll return any non-deleted volumes in the app.
That’s not though! This seems to be a flyctl regression from 0.3.214 on. If you rollback to 0.3.213 again with curl -L https://fly.io/install.sh | sh -s 0.3.213 the fork command should work as expected.
Ok! Rolling back to 0.3.213 did the trick on the pg create, and I then updated DATABASE_URL on my app so that it points to the new PG app. And it seems to be good.
And connecting with SSH to the machine and then entereing psql to list some records, it does seem to not have lost any data. But for some reason I’m not able to sign in. I’m 90% sure my password is correct (coming from my password manager), but ir rejects it.
The proxy would help here, but that’s still not working…
If you manually added the new database_url to the machine then the attach command shouldn’t be needed. It’s essentially a wrapper over updating the app secret manually.
Just to confirm, you can’t log in to the database from the machine (ie using psql), or you can’t log in to your app itself?
If it’s the former, the new database has a new user / password pair for the Postgres db, so you’d need to ensure you’re using that. It’s in the database url, you can check which the machine is using by running env from within it.
Well, the env is just what I was looking for! Thanks!
I meant I’m unable to log into my app itself. Everything on Fly’s end seems to be good. Even proxy now. But I don’t know why I thought a visual app would help over what psql already did… It’s honestly not helping at all. And env just confirmed it is using the new DATABASE_URL.
I have done a migration before, when MAD had hardware issues and I needed to manually switch to a machine in a different region. But I don’t recall having this issue (not being able to log into my app).
I suspected that even though I could list all users from psql and everything seemed fine, my app might be connecting to a different database.
So I went into SSH again and \l to list all databases. I had: postgres, repmgr, template0, template1 and then one with the name of my app (the one I actually want to connect to).
Apparently—thanks ChatGPT—Rails by default connects to the same DB as the user you’re connecting with. So, since the user is postgres, it was connecting to postgres database, which doesn’t have my app’s actual users.
The only thing I did was appending /my_app at the end of the DATABASE_URL secret. All good now.