Postgres database down, can't restart instance or machine

Was trying to test some stuff out with my development app and I tried to log in to it which failed. Looking in my logs for my app it shows that it can’t connect to my db instance. So I then attempted to connect remotely through a proxy which also failed.

Digging deeper I found the monitoring logs with this error for the db app:
2023-10-04T14:58:07.707 proxy[6e82d4e6ae9698] lax [error] timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)
I then tried to restart my pg instance

flyctl postgres restart -a [db-app-name] --force
No leader found, but continuing with restart
Identifying cluster role(s)
  Machine [machine_id]: error
Restarting machine [machine_id]
Error: could not stop machine [machine_id]: failed to restart VM [machine_id]: failed_precondition: unable to restart machine, not currently started or stopped (Request ID: [request-id])

Still no go. Then I tried another suggestion to check the machine status and try and restart it through the machine commands.
fly machines list -a shows one machine that is in a stopping state.
Trying to restart that machine fails

Error: could not start machine [machine_id]: failed to start VM [machine_id]: failed_precondition: unable to start machine from current state: 'stopping' (Request ID: [request-id])

Not sure what brought this on, looking at the db snapshots the size hasn’t changed much and it isn’t near the limit.

24 hours later and it’s still down :(.

Hi!

The particular host on which your VM resides had trouble earlier in the week and was down for a few hours. The machine should have come back up when maintenance was completed, but it didn’t - we’re looking into it.

Keep in mind it’s usually a good idea for critical databases to have redundancy; once it’s back up, add a couple of extra nodes with fly machine clone so you get more resiliency in case of a single node going down. This has an extra cost, of course - it’s up to you to balance uptime requirements with budget constraints.

I’ll poke again once the machine is up unless you notice it first, in which case, please let us know :slight_smile:

Cheers!

Hi again,

Your machine is now properly stopped, you can now fly machines restart it, let us know if that works.

Regards,

  • Daniel

Merh :/.

fly machines restart [machine_id] -a [my-db-app]
Restarting machine [machine_id]
Error: failed to restart machine [machine_id]: could not stop machine [machine_id]: failed to restart VM [vm_id]: failed_precondition: machine still restarting (Request ID: [request_id])

Doing a list shows that machine stopped. Is it possible to just take the snapshot and blow this away, start a different one?

Hey, sorry to hear your database is down. Bugs me when that happens to my own apps so I completely get where you’re coming from.

Just so we’re on the same page, my aim here is to bring up a new database instance with your existing data intact. I’m not very interested in determining why the existing instance crashed and just want to get your data out of it. We’ll start by getting access to the database files, bring them up in a new machine so we can dump them to SQL, then load that into a new instance. You may be able to skip the second step entirely if you like living dangerously, but problems will probably be a bit more difficult to debug if you just bring up the old database in a new instance and something breaks.

Destroying a machine doesn’t destroy its attached volumes. Let’s start by finding the volume with your data:
fly m status <machine ID of busted Postgres>
Look for a line like:

  Volume        = vol_2yxp4mng75zr63qd                                                                                  

That’s the volume containing your data. Make a note of it. Then:
fly m destroy --force <ID of doomed Postgres>
Hopefully that database isn’t too cursed and can at least be force destroyed.
Then reattach that volume into a new machine. Check out fly m run, pick an appropriate incantation for your needs, and make sure you’re using the -v flat with that volume ID. I’d probably load up a Postgres image, mount the volume where the image expects to find its data, and dump it out somewhere. Once you’ve got a database dump, you should be able to create a new instance and load in the old data with our existing fly postgres commands.

Hope that helps, please let me know if you have more questions.

1 Like

I was able to blow the machine away and I ran this:
fly m run postgres -a [my_app] -v [my_volume]:/var/lib/pgsql/data
Which spit this out:

Searching for image 'postgres' remotely...
image found: img_rj5yv11x887vdwq7
Image: registry-1.docker.io/library/postgres:latest
Image size: 151 MB

Success! A machine has been successfully launched in app [my_app]
 Machine ID: [machine_id]
 Instance ID: [instance]
 State: created

 Attempting to start machine...

==> Monitoring health checks
No health checks found

Machine started, you can connect via the following private ip
  [ip]
2023-10-06T17:12:40.235 app[xxx] lax [info] [ 0.038655] PCI: Fatal: No config space access function found

2023-10-06T17:12:40.401 app[xxx] lax [info] INFO Starting init (commit: 5d9c42f)...

2023-10-06T17:12:40.417 app[xxx] lax [info] INFO Mounting /dev/vdb at /var/lib/pgsql/data w/ uid: 0, gid: 0 and chmod 0755

2023-10-06T17:12:40.472 app[xxx] lax [info] INFO Resized /var/lib/pgsql/data to 1069547520 bytes

2023-10-06T17:12:40.473 app[xxx] lax [info] INFO Preparing to run: `docker-entrypoint.sh postgres` as root

2023-10-06T17:12:40.522 app[xxx] lax [info] INFO [fly api proxy] listening at /.fly/api

2023-10-06T17:12:40.528 app[xxx] lax [info] 2023/10/06 17:12:40 listening on [fdaa:2:4c9e:a7b:112:d0c2:99da:2]:22 (DNS: [fdaa::3]:53)

2023-10-06T17:12:40.592 app[xxx] lax [info] Error: Database is uninitialized and superuser password is not specified.

2023-10-06T17:12:40.592 app[xxx] lax [info] You must specify POSTGRES_PASSWORD to a non-empty value for the

2023-10-06T17:12:40.592 app[xxx] lax [info] superuser. For example, "-e POSTGRES_PASSWORD=password" on "docker run".

2023-10-06T17:12:40.592 app[xxx] lax [info] You may also use "POSTGRES_HOST_AUTH_METHOD=trust" to allow all

2023-10-06T17:12:40.592 app[xxx] lax [info] connections without a password. This is *not* recommended.

2023-10-06T17:12:40.592 app[xxx] lax [info] See PostgreSQL documentation about "trust":

2023-10-06T17:12:40.592 app[xxx] lax [info] https://www.postgresql.org/docs/current/auth-trust.html

2023-10-06T17:12:41.523 app[xxx] lax [info] INFO Main child exited normally with code: 1

2023-10-06T17:12:41.524 app[xxx] lax [info] INFO Starting clean up.

2023-10-06T17:12:41.524 app[xxx] lax [info] INFO Umounting /dev/vdb from /var/lib/pgsql/data

2023-10-06T17:12:41.526 app[xxx] lax [info] WARN hallpass exited, pid: 312, status: signal: 15 (SIGTERM)

2023-10-06T17:12:41.530 app[xxx] lax [info] 2023/10/06 17:12:41 listening on [ip]:22 (DNS: [fdaa::3]:53)

2023-10-06T17:12:42.527 app[xx] lax [info] [ 2.328224] reboot: Restarting system

I’m going to guess that’s not the right image?

I think that should be fine. It gives you a few tips for what to do in the logs:

2023-10-06T17:12:40.592 app[xxx] lax [info] You must specify POSTGRES_PASSWORD to a non-empty value for the

2023-10-06T17:12:40.592 app[xxx] lax [info] superuser. For example, "-e POSTGRES_PASSWORD=password" on "docker run".

2023-10-06T17:12:40.592 app[xxx] lax [info] You may also use "POSTGRES_HOST_AUTH_METHOD=trust" to allow all

2023-10-06T17:12:40.592 app[xxx] lax [info] connections without a password. This is *not* recommended.

Specify one of those environment variables via -e and you should be good to go. Note that all we’re trying to do here is to boot up the machine so you can dump the database to text, get the data out with something like fly sftp, then load it back in with fly pg connect and loading your dump through the Postgres console. Ideally you still have your previous password, but if not then it’s probably more important that you get your data out than authenticate with the right credentials.

Hope that helps.

Well, I think I got everything setup right, but my db isn’t in there. I just see postgres, template0 and template1. When I try and connect directly to that database it says it doesn’t exist. I tried to go back to one of the snapshots from Oct 1 and I still don’t see any of my data in there. Not looking good :/.

Clearly I’ve done something wrong based on the limited amount of instructions given. Something hasn’t mounted properly or I don’t know. Anyways it is beyond what I want to deal with and I’ve purchased a plan to get support and am just waiting for that to get answered at some point.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.