Error failed to launch VM: nats: no responders available for request

Not entirely sure what’s up, there are no issues listed on https://status.flyio.net/

❯ fly postgres create --name flxwebsites-db-staging
? Select Organization: FLX Websites (flx-websites)
? Select region: Secaucus, NJ (US) (ewr)
For pricing information visit: https://fly.io/docs/about/pricing/#postgresql-cl
? Select configuration: Production - Highly available, 2x shared CPUs, 4GB RAM, 40GB disk
Creating postgres cluster in organization flx-websites
Creating app...
Setting secrets on app flxwebsites-db-staging...
Provisioning 1 of 2 machines with image flyio/postgres:14.4
Waiting for machine to start...
Machine e148e056ae0892 is created
Provisioning 2 of 2 machines with image flyio/postgres:14.4
Error failed to launch VM: nats: no responders available for request

Hey @nicksergeant! We saw your Tweet as well and have been looking at things, will let you know when things are fixed.

Ok, would you mind giving it another try?

It looks different than it did earlier, but hanging on health checks:

==> Monitoring health checks
  Waiting for 7328725ec43859 to become healthy (started, 3/3)
  Waiting for 21781e60b06896 to become healthy (started, 1/3)

Logs in the dashboard look suspect, also:

2022-11-26T02:19:34.060 app[21781e60b06896] ewr [info] exporter | ERRO[0264] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:db85:a7b:94:d127:926e:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:db85:a7b:94:d127:926e:2]:5433: connect: connection refused source="postgres_exporter.go:1658"

2022-11-26T02:19:48.063 app[21781e60b06896] ewr [info] exporter | INFO[0278] Established new database connection to "fdaa:0:db85:a7b:94:d127:926e:2:5433". source="postgres_exporter.go:970"

2022-11-26T02:19:49.064 app[21781e60b06896] ewr [info] exporter | ERRO[0279] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:db85:a7b:94:d127:926e:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:db85:a7b:94:d127:926e:2]:5433: connect: connection refused source="postgres_exporter.go:1658"

2022-11-26T02:19:57.151 app[21781e60b06896] ewr [info] keeper | pg_basebackup: error: connection to server at "fdaa:0:db85:a7b:95:6753:be9e:2", port 5433 failed: Connection timed out

2022-11-26T02:19:57.151 app[21781e60b06896] ewr [info] keeper | Is the server running on that host and accepting TCP/IP connections?

2022-11-26T02:19:57.152 app[21781e60b06896] ewr [info] keeper | 2022-11-26T02:19:57.151Z ERROR cmd/keeper.go:1364 failed to resync from followed instance {"error": "sync error: exit status 1"}

2022-11-26T02:20:02.287 app[21781e60b06896] ewr [info] keeper | 2022-11-26T02:20:02.286Z ERROR cmd/keeper.go:1109 db failed to initialize or resync

Actually looks like it just went through.

Nice! It can take some time for all of the health checks to be passing.

1 Like

Err, yeah, I think something’s still borked:

Proxying local port 15432 to remote [flxwebsites-db-staging.internal]:5432
psql: error: connection to server at "localhost" (127.0.0.1), port 15432 failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
make: *** [connect-db-local] Error 2

The connection error is also consistent with an issue I was seeing earlier when attempting to create a pg instance on the Development configuration. I was getting a pooling connection failure from a service that was attempting to connect (via another Fly app).

fly postgres connect -a flxwebsites-db-staging works fine but if I flyctl proxy 15432:5432 -a flxwebsites-db-staging and then psql postgres://postgres:<password>@localhost:15432, I get:

psql: error: connection to server at "localhost" (127.0.0.1), port 15432 failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

Hi Nick, what postgresql version are you running locally? the output of psql --version

14.5

One of the nodes is still unhealthy

ID              STATE   ROLE    REGION  HEALTH CHECKS                   IMAGE                           CREATED                 UPDATED  
21781e60b06896  started error   ewr     3 total, 1 passing, 2 critical  flyio/postgres:14.4 (v0.0.32)   2022-11-26T02:14:56Z    2022-11-26T02:15:09Z
7328725ec43859  started leader  ewr     3 total, 3 passing              flyio/postgres:14.4 (v0.0.32)   2022-11-26T02:14:37Z    2022-11-26T02:14:52Z

try selecting which instance to proxy for:

$ fly proxy 15432:5432 -a flxwebsites-db-staging -s
? Select instance: ewr.flxwebsites-db-staging.internal (fdaa:0:db85:a7b:95:6753:be9e:2)
Proxying local port 15432 to remote [fdaa:0:db85:a7b:95:6753:be9e:2]:5432

and then psql to it

psql: error: connection to server at "localhost" (127.0.0.1), port 15432 failed: FATAL:  password authentication failed for user "postgres"

(it failed for me because I don’t know the password)

if that works, delete instance 21781e60b06896 and clone 7328725ec43859 to add a new replica

I went ahead and cloned the leader and stopped the offending instance

~$ fly m clone -a flxwebsites-db-staging  7328725ec43859
Cloning machine 7328725ec43859 into region ewr
Provisioning a new machine with image flyio/postgres:14.4...
  Machine 9185ee3c429583 has been created...
  Waiting for machine 9185ee3c429583 to start...
  Waiting for 9185ee3c429583 to become healthy (started, 3/3)
Machine has been successfully cloned!

~$ fly m stop -a flxwebsites-db-staging 21781e60b06896
Sending kill signal to machine 21781e60b06896...
21781e60b06896 has been successfully stopped

I didn’t remove it just in case, but feel free to do so with fly machine remove anytime.

Thanks! What’s the command you used to do the health check on the individual instances?

fly checks list

2 Likes

:+1::+1: thanks!

1 Like

A (regular, non-db) machine I have got that has entered this state refuses to be removed: Machines error blocking deploys: No responders available for request

What gives? I want it gone too as it’s blocking our use of flyctl for deploys.

I guess everyone should tweet than create posts here :wink: seems super effective: https://archive.is/TFGIo

1 Like

@dangra is there a reason that deploying the configuration Production - Highly available, 2x shared CPUs, 4GB RAM, 40GB disk in EWR continues to only result in one healthy instance?

I tried again this morning to create this cluster (I’m working on creating repeatable scripts for my infra, so I want to make sure this is consistent), and I still only have one instance healthy:

❯ fly checks list -a flxwebsites-db
Health Checks for flxwebsites-db
  NAME | STATUS   | MACHINE        | LAST UPDATED | OUTPUT
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  pg   | passing  | 0e286961b50867 | 4m36s ago    | [✓] transactions: read/write (341.43µs)
       |          |                |              | [✓] connections: 9 used, 3 reserved, 300 max (4.39ms)
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  role | passing  | 0e286961b50867 | 4m37s ago    | leader
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  vm   | passing  | 0e286961b50867 | 5m3s ago     | [✓] checkDisk: 37.06 GB (94.7%) free space on /data/ (86.66µs)
       |          |                |              | [✓] checkLoad: load averages: 0.00 0.00 0.00 (181.59µs)
       |          |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (96.84µs)
       |          |                |              | [✓] cpu: system spent 0s of the last 60s waiting on cpu (82.71µs)
       |          |                |              | [✓] io: system spent 0s of the last 60s waiting on io (73.62µs)
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  pg   | critical | 21781e67a03896 | 4m43s ago    | 500 Internal Server Error
       |          |                |              | failed to connect to proxy: context deadline exceeded
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  role | critical | 21781e67a03896 | 4m32s ago    | 500 Internal Server Error
       |          |                |              | failed to connect to local node: context deadline exceeded
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  vm   | passing  | 21781e67a03896 | 3m57s ago    | [✓] checkDisk: 37.06 GB (94.7%) free space on /data/ (67.84µs)
       |          |                |              | [✓] checkLoad: load averages: 0.00 0.00 0.00 (68.78µs)
       |          |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (79.78µs)
       |          |                |              | [✓] cpu: system spent 144ms of the last 60s waiting on cpu (31.2µs)
       |          |                |              | [✓] io: system spent 384ms of the last 60s waiting on io (22.64µs)
-------*----------*----------------*--------------*--------------------------------------------------------------------------

I’m pretty sure the problem here is the ewr region. I switched to iad and everything works perfectly fine. Launching a new Postgres 1.7 app with DB results in "Error release command failed, deployment aborted" - #8 by nicksergeant

True, a misconfigured network interface on one host of ewr region was causing issues to apps whose machines were assigned to run there. It is fixed now. thanks.

1 Like

@dangra :pray: thank you!

1 Like