❯ fly postgres create --name flxwebsites-db-staging
? Select Organization: FLX Websites (flx-websites)
? Select region: Secaucus, NJ (US) (ewr)
For pricing information visit: https://fly.io/docs/about/pricing/#postgresql-cl
? Select configuration: Production - Highly available, 2x shared CPUs, 4GB RAM, 40GB disk
Creating postgres cluster in organization flx-websites
Creating app...
Setting secrets on app flxwebsites-db-staging...
Provisioning 1 of 2 machines with image flyio/postgres:14.4
Waiting for machine to start...
Machine e148e056ae0892 is created
Provisioning 2 of 2 machines with image flyio/postgres:14.4
Error failed to launch VM: nats: no responders available for request
Proxying local port 15432 to remote [flxwebsites-db-staging.internal]:5432
psql: error: connection to server at "localhost" (127.0.0.1), port 15432 failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
make: *** [connect-db-local] Error 2
The connection error is also consistent with an issue I was seeing earlier when attempting to create a pg instance on the Development configuration. I was getting a pooling connection failure from a service that was attempting to connect (via another Fly app).
fly postgres connect -a flxwebsites-db-staging works fine but if I flyctl proxy 15432:5432 -a flxwebsites-db-staging and then psql postgres://postgres:<password>@localhost:15432, I get:
psql: error: connection to server at "localhost" (127.0.0.1), port 15432 failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
ID STATE ROLE REGION HEALTH CHECKS IMAGE CREATED UPDATED
21781e60b06896 started error ewr 3 total, 1 passing, 2 critical flyio/postgres:14.4 (v0.0.32) 2022-11-26T02:14:56Z 2022-11-26T02:15:09Z
7328725ec43859 started leader ewr 3 total, 3 passing flyio/postgres:14.4 (v0.0.32) 2022-11-26T02:14:37Z 2022-11-26T02:14:52Z
try selecting which instance to proxy for:
$ fly proxy 15432:5432 -a flxwebsites-db-staging -s
? Select instance: ewr.flxwebsites-db-staging.internal (fdaa:0:db85:a7b:95:6753:be9e:2)
Proxying local port 15432 to remote [fdaa:0:db85:a7b:95:6753:be9e:2]:5432
and then psql to it
psql: error: connection to server at "localhost" (127.0.0.1), port 15432 failed: FATAL: password authentication failed for user "postgres"
(it failed for me because I don’t know the password)
if that works, delete instance 21781e60b06896 and clone 7328725ec43859 to add a new replica
I went ahead and cloned the leader and stopped the offending instance
~$ fly m clone -a flxwebsites-db-staging 7328725ec43859
Cloning machine 7328725ec43859 into region ewr
Provisioning a new machine with image flyio/postgres:14.4...
Machine 9185ee3c429583 has been created...
Waiting for machine 9185ee3c429583 to start...
Waiting for 9185ee3c429583 to become healthy (started, 3/3)
Machine has been successfully cloned!
~$ fly m stop -a flxwebsites-db-staging 21781e60b06896
Sending kill signal to machine 21781e60b06896...
21781e60b06896 has been successfully stopped
I didn’t remove it just in case, but feel free to do so with fly machine remove anytime.
@dangra is there a reason that deploying the configuration Production - Highly available, 2x shared CPUs, 4GB RAM, 40GB disk in EWR continues to only result in one healthy instance?
I tried again this morning to create this cluster (I’m working on creating repeatable scripts for my infra, so I want to make sure this is consistent), and I still only have one instance healthy:
❯ fly checks list -a flxwebsites-db
Health Checks for flxwebsites-db
NAME | STATUS | MACHINE | LAST UPDATED | OUTPUT
-------*----------*----------------*--------------*--------------------------------------------------------------------------
pg | passing | 0e286961b50867 | 4m36s ago | [✓] transactions: read/write (341.43µs)
| | | | [✓] connections: 9 used, 3 reserved, 300 max (4.39ms)
-------*----------*----------------*--------------*--------------------------------------------------------------------------
role | passing | 0e286961b50867 | 4m37s ago | leader
-------*----------*----------------*--------------*--------------------------------------------------------------------------
vm | passing | 0e286961b50867 | 5m3s ago | [✓] checkDisk: 37.06 GB (94.7%) free space on /data/ (86.66µs)
| | | | [✓] checkLoad: load averages: 0.00 0.00 0.00 (181.59µs)
| | | | [✓] memory: system spent 0s of the last 60s waiting on memory (96.84µs)
| | | | [✓] cpu: system spent 0s of the last 60s waiting on cpu (82.71µs)
| | | | [✓] io: system spent 0s of the last 60s waiting on io (73.62µs)
-------*----------*----------------*--------------*--------------------------------------------------------------------------
pg | critical | 21781e67a03896 | 4m43s ago | 500 Internal Server Error
| | | | failed to connect to proxy: context deadline exceeded
-------*----------*----------------*--------------*--------------------------------------------------------------------------
role | critical | 21781e67a03896 | 4m32s ago | 500 Internal Server Error
| | | | failed to connect to local node: context deadline exceeded
-------*----------*----------------*--------------*--------------------------------------------------------------------------
vm | passing | 21781e67a03896 | 3m57s ago | [✓] checkDisk: 37.06 GB (94.7%) free space on /data/ (67.84µs)
| | | | [✓] checkLoad: load averages: 0.00 0.00 0.00 (68.78µs)
| | | | [✓] memory: system spent 0s of the last 60s waiting on memory (79.78µs)
| | | | [✓] cpu: system spent 144ms of the last 60s waiting on cpu (31.2µs)
| | | | [✓] io: system spent 384ms of the last 60s waiting on io (22.64µs)
-------*----------*----------------*--------------*--------------------------------------------------------------------------
True, a misconfigured network interface on one host of ewr region was causing issues to apps whose machines were assigned to run there. It is fixed now. thanks.