Postgres app connection issue with website app

Hello Guys!

I seek your help once again in my struggle to make my website stable again.

For a strange reason my website doesn’t run. I have html 500 error code upon loading the main page. The logs from web app machine is all ok. Machine starts and run. However the postgres VM is screaming of errors.

2025-01-24T02:25:19.939 app[178103db9120d8] sin [info] repmgrd | Running…

2025-01-24T02:25:19.947 app[178103db9120d8] sin [info] repmgrd | [2025-01-24 02:25:19] [NOTICE] repmgrd (repmgrd 5.4.1) starting up

2025-01-24T02:25:19.947 app[178103db9120d8] sin [info] repmgrd | [2025-01-24 02:25:19] [INFO] connecting to database “host=fdaa:2:e0e4:a7b:80:2ecf:9e82:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5”

2025-01-24T02:25:19.971 app[178103db9120d8] sin [info] repmgrd | INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid

2025-01-24T02:25:19.971 app[178103db9120d8] sin [info] repmgrd | [2025-01-24 02:25:19] [NOTICE] starting monitoring of node “fdaa:2:e0e4:a7b:80:2ecf:9e82:2” (ID: 1950994327)

2025-01-24T02:25:19.971 app[178103db9120d8] sin [info] repmgrd | [2025-01-24 02:25:19] [INFO] “connection_check_type” set to “ping”

2025-01-24T02:25:20.500 app[178103db9120d8] sin [info] failed post-init: failed to register existing standby: failed to register standby: exit status 1. Retrying…

2025-01-24T02:25:24.978 app[178103db9120d8] sin [info] repmgrd | [2025-01-24 02:25:24] [ERROR] connection to database failed

2025-01-24T02:25:24.978 app[178103db9120d8] sin [info] repmgrd | [2025-01-24 02:25:24] [DETAIL]

2025-01-24T02:25:24.978 app[178103db9120d8] sin [info] repmgrd | connection to server at “fdaa:2:e0e4:a7b:2cd:1ca8:cf98:2”, port 5433 failed: timeout expired

2025-01-24T02:25:24.978 app[178103db9120d8] sin [info] repmgrd |

2025-01-24T02:25:24.978 app[178103db9120d8] sin [info] repmgrd | [2025-01-24 02:25:24] [DETAIL] attempted to connect using:

2025-01-24T02:25:24.978 app[178103db9120d8] sin [info] repmgrd | user=repmgr connect_timeout=5 dbname=repmgr host=fdaa:2:e0e4:a7b:2cd:1ca8:cf98:2 port=5433 fallback_application_name=repmgr options=-csearch_path=

2025-01-24T02:25:24.978 app[178103db9120d8] sin [info] repmgrd | [2025-01-24 02:25:24] [ERROR] unable connect to upstream node (ID: 656412248), terminating

2025-01-24T02:25:24.978 app[178103db9120d8] sin [info] repmgrd | [2025-01-24 02:25:24] [HINT] upstream node must be running before repmgrd can start

2025-01-24T02:25:24.978 app[178103db9120d8] sin [info] repmgrd | [2025-01-24 02:25:24] [INFO] repmgrd terminating…

2025-01-24T02:25:24.981 app[178103db9120d8] sin [info] repmgrd | exit status 6

2025-01-24T02:25:24.981 app[178103db9120d8] sin [info] repmgrd | restarting in 5s [attempt 2369]

It is hard for me to understand much from error logs. I can only understand that there is a communication issue between two.

To give you a little context. This is Django framework. Launched without DB and DB created and attached with flyctl db create and flyctl db attach
Then I had no change DB_URL_ENV to select the table with my existing data instead of blank one.

Sorry to hear that there’s been trouble still, :adhesive_bandage:. This looks like the Postgres process per se might not actually be running—and the faithful watchdog repmgr is keening pitiously at its absence…

As an initial diagnostic, what do fly m list -a db-app-name and fly checks list -a db-app-name show?


Aside: You can paste logs and terminal output using the </> button (“Preformatted text”) in the edit toolbar. (Or by using triple backticks (```) directly.)

This avoids skewing the box-drawing alignment, etc.

Yeah, it is a little bit frustrating. I thought the hard part was to make it all work in dev. server :laughing:

This is what fly m list -a db-app-name yields

ID              NAME                    STATE   CHECKS  REGION  ROLE    IMAGE                                   IP ADDRESS                      VOLUME                  CREATED                 LAST UPDATED            PROCESS GROUP   SIZ
E
178103db9120d8  falling-paper-9949      started 3/3     sin     replica flyio/postgres-flex:16.3 (v0.0.61)      fdaa:2:e0e4:a7b:80:2ecf:9e82:2  vol_v88ogdwd9kmg8p1v    2025-01-16T12:30:30Z    2025-01-23T19:49:12Z                    sha
red-cpu-1x:256MB

here is one more stat from fly checks list -a db-app-name

Health Checks for make-my-cake-workload-planner-db 
  NAME | STATUS  | MACHINE        | LAST UPDATED | OUTPUT                                                                      
-------*---------*----------------*--------------*----------------------------------------------------------------------------
  pg   | passing | 178103db9120d8 | 8h53m ago    | [✓] connections: 9 used, 3 reserved, 300 max (38.32ms)
-------*---------*----------------*--------------*----------------------------------------------------------------------------
  role | passing | 178103db9120d8 | 8h53m ago    | replica
-------*---------*----------------*--------------*----------------------------------------------------------------------------
  vm   | passing | 178103db9120d8 | 8h53m ago    | [✓] checkDisk: 791.53 MB (80.3%) free space on /data/ (35.64µs)
       |         |                |              | [✓] checkLoad: load averages: 0.00 0.00 0.00 (50.05µs)
       |         |                |              | [✓] memory: system spent 12ms of the last 60s waiting on memory (29.43µs)   
       |         |                |              | [✓] cpu: system spent 1.22s of the last 60s waiting on cpu (20.16µs)
       |         |                |              | [✓] io: system spent 474ms of the last 60s waiting on io (21.91µs)

I can only see that the role is replica. This might be the issue but this is the result of me trying to move the shit around to make it work again :cry: There used tow be two more vm that were not passing all the health checks 2/3 and the eror fell on connection side.

Ah… Yeah, these PG Flex clusters are a little delicate when it comes to deleting nodes, :dragon:.

What was the failing check, back when you had 3 nodes?

The only reason I started to delete them because 2/3 VM could not pass the health check. Can’t remember exact wording but it is due to connection issue. Timeout or host unreachable.

Hm… I think I would try cloning out two new Machines—and see if the new set of three elects a primary.

(Remove any lingering, unattached volumes first.)

In general, though, these Postgres clusters assume a fair amount of gritty knowledge and care when managing sometimes. (I don’t maintain them myself, except for occasional experiments.)

I can’t clone it …

Volume 'pg_data' will start empty 
Provisioning a new Machine with image docker-hub-mirror.fly.io/flyio/postgres-flex:16.3@sha256:ad222bd07df8afa497134e1b80e8644759c46465a79c39a5e3bbd051c69d5bd7... 
  Machine 78156d1c94d448 has been created... 
  Waiting for Machine 78156d1c94d448 to start... 
  Waiting for 78156d1c94d448 to become healthy (stopped, 0/3) 
Error: error while watching health checks: All attempts fail: 
#1: failed to get VM 78156d1c94d448: Get "https://api.machines.dev/v1/apps/make-my-cake-workload-planner-db/machines/78156d1c94d448": net/http: request canceled
#2: context deadline exceeded

Hm… If the older Machine retains its replica status, then I’d suggest a verification step next:

$ fly m start 178103db9120d8 -a db-app-name  # ensure running.
$ fly ssh console --machine 178103db9120d8 -a db-app-name
# su postgres
% psql -p 5433  # note that this is 5433, not the usual 5432.
postgres=#

Then poke around a little, like before, to examine the tables…

(What this is doing is bypassing all the watchdogs, etc., allowing it to succeed where fly pg connect would fail.)

If your data is there, then it’s mainly a matter of how inconvenient it would be to turn things back into a repaired and/or fresh PG cluster. I don’t know the PG Flex mechanisms in low-enough-level detail to advise from afar on how to perform surgery on one, but likely there is some scalpel slice with which you can inform repmgr that there no longer is a primary that it should be trying to contact. The following classic post by @uncvrd might have the clues that a sufficiently admin-minded person would need to piece that together, even though it’s not precisely the same situation:

https://community.fly.io/t/heres-how-to-fix-an-unreachable-2-zombie-1-replica-ha-postgres-cluster/19503

Other than such measures, there’s the inelegant but (typically) rather effective technique of fly pg create --fork-from:

https://community.fly.io/t/urgency-problems-with-postgres-the-database-is-not-responding/19926/2

(You will need the explicit volume ID here, since there is no primary.)

This one would require you to go through the DATABASE_URL or DB_URL_ENV dance again, and of course this doesn’t resolve the question of why the cluster failed in the first place, :thought_balloon:.

Hope this helps a little!

Thanks for your effort. Much appreciated

I am not too fluent with SQL syntax and poke around does only sound cool not much I can do. Sorry :frowning: While listing db I see all the tables that are in it.

postgres-# \l 
                                                                          List of databases 
                       Name                        |  Owner   | Encoding | Locale Provider |  Collate   |   Ctype    | ICU Locale | ICU Rules |   Access privileges
---------------------------------------------------+----------+----------+-----------------+------------+------------+------------+-----------+-----------------------
 make_my_cake_workload_planner                     | postgres | UTF8     | libc            | en_US.utf8 | en_US.utf8 |            |           |
 postgres                                          | postgres | UTF8     | libc            | en_US.utf8 | en_US.utf8 |            |           |
 repmgr                                            | repmgr   | UTF8     | libc            | en_US.utf8 | en_US.utf8 |            |           |
 sugarcane_bakery_workload_planner_muddy_pine_8523 | postgres | UTF8     | libc            | en_US.utf8 | en_US.utf8 |            |           |
 template0                                         | postgres | UTF8     | libc            | en_US.utf8 | en_US.utf8 |            |           | =c/postgres          +
--More--

sugarcane_bakery_workload_planner_muddy_pine_8523 this is the one I restored from image and it has all my data that I want to save.

What if we kill it all and allow fly launch to do the job and configure it all right. How can we transfer my table to a new app?

Interesting enough. I used the guidelines described by the gracious person on this forum page

It worked!!!

I don’t know what. I don’t know how but it works. I cloned my volume and put it to a different region as well as cloned my PG machine. The app is up and running

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.