Error failed to launch VM: nats: no responders available for request

nicksergeant · November 26, 2022, 1:24am

Not entirely sure what’s up, there are no issues listed on https://status.flyio.net/

❯ fly postgres create --name flxwebsites-db-staging
? Select Organization: FLX Websites (flx-websites)
? Select region: Secaucus, NJ (US) (ewr)
For pricing information visit: https://fly.io/docs/about/pricing/#postgresql-cl
? Select configuration: Production - Highly available, 2x shared CPUs, 4GB RAM, 40GB disk
Creating postgres cluster in organization flx-websites
Creating app...
Setting secrets on app flxwebsites-db-staging...
Provisioning 1 of 2 machines with image flyio/postgres:14.4
Waiting for machine to start...
Machine e148e056ae0892 is created
Provisioning 2 of 2 machines with image flyio/postgres:14.4
Error failed to launch VM: nats: no responders available for request

JP_Phillips · November 26, 2022, 1:29am

Hey @nicksergeant! We saw your Tweet as well and have been looking at things, will let you know when things are fixed.

JP_Phillips · November 26, 2022, 1:34am

Ok, would you mind giving it another try?

nicksergeant · November 26, 2022, 2:20am

It looks different than it did earlier, but hanging on health checks:

==> Monitoring health checks
  Waiting for 7328725ec43859 to become healthy (started, 3/3)
  Waiting for 21781e60b06896 to become healthy (started, 1/3)

Logs in the dashboard look suspect, also:

2022-11-26T02:19:34.060 app[21781e60b06896] ewr [info] exporter | ERRO[0264] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:db85:a7b:94:d127:926e:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:db85:a7b:94:d127:926e:2]:5433: connect: connection refused source="postgres_exporter.go:1658"

2022-11-26T02:19:48.063 app[21781e60b06896] ewr [info] exporter | INFO[0278] Established new database connection to "fdaa:0:db85:a7b:94:d127:926e:2:5433". source="postgres_exporter.go:970"

2022-11-26T02:19:49.064 app[21781e60b06896] ewr [info] exporter | ERRO[0279] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:db85:a7b:94:d127:926e:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:db85:a7b:94:d127:926e:2]:5433: connect: connection refused source="postgres_exporter.go:1658"

2022-11-26T02:19:57.151 app[21781e60b06896] ewr [info] keeper | pg_basebackup: error: connection to server at "fdaa:0:db85:a7b:95:6753:be9e:2", port 5433 failed: Connection timed out

2022-11-26T02:19:57.151 app[21781e60b06896] ewr [info] keeper | Is the server running on that host and accepting TCP/IP connections?

2022-11-26T02:19:57.152 app[21781e60b06896] ewr [info] keeper | 2022-11-26T02:19:57.151Z ERROR cmd/keeper.go:1364 failed to resync from followed instance {"error": "sync error: exit status 1"}

2022-11-26T02:20:02.287 app[21781e60b06896] ewr [info] keeper | 2022-11-26T02:20:02.286Z ERROR cmd/keeper.go:1109 db failed to initialize or resync

nicksergeant · November 26, 2022, 2:21am

Actually looks like it just went through.

JP_Phillips · November 26, 2022, 2:22am

Nice! It can take some time for all of the health checks to be passing.

nicksergeant · November 26, 2022, 2:23am

Err, yeah, I think something’s still borked:

Proxying local port 15432 to remote [flxwebsites-db-staging.internal]:5432
psql: error: connection to server at "localhost" (127.0.0.1), port 15432 failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
make: *** [connect-db-local] Error 2

nicksergeant · November 26, 2022, 2:27am

The connection error is also consistent with an issue I was seeing earlier when attempting to create a pg instance on the Development configuration. I was getting a pooling connection failure from a service that was attempting to connect (via another Fly app).

fly postgres connect -a flxwebsites-db-staging works fine but if I flyctl proxy 15432:5432 -a flxwebsites-db-staging and then psql postgres://postgres:<password>@localhost:15432, I get:

psql: error: connection to server at "localhost" (127.0.0.1), port 15432 failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

dangra · November 26, 2022, 2:36am

Hi Nick, what postgresql version are you running locally? the output of psql --version

nicksergeant · November 26, 2022, 2:38am

14.5

dangra · November 26, 2022, 2:48am

One of the nodes is still unhealthy

ID              STATE   ROLE    REGION  HEALTH CHECKS                   IMAGE                           CREATED                 UPDATED  
21781e60b06896  started error   ewr     3 total, 1 passing, 2 critical  flyio/postgres:14.4 (v0.0.32)   2022-11-26T02:14:56Z    2022-11-26T02:15:09Z
7328725ec43859  started leader  ewr     3 total, 3 passing              flyio/postgres:14.4 (v0.0.32)   2022-11-26T02:14:37Z    2022-11-26T02:14:52Z

try selecting which instance to proxy for:

$ fly proxy 15432:5432 -a flxwebsites-db-staging -s
? Select instance: ewr.flxwebsites-db-staging.internal (fdaa:0:db85:a7b:95:6753:be9e:2)
Proxying local port 15432 to remote [fdaa:0:db85:a7b:95:6753:be9e:2]:5432

and then psql to it

psql: error: connection to server at "localhost" (127.0.0.1), port 15432 failed: FATAL:  password authentication failed for user "postgres"

(it failed for me because I don’t know the password)

if that works, delete instance 21781e60b06896 and clone 7328725ec43859 to add a new replica

dangra · November 26, 2022, 3:04am

I went ahead and cloned the leader and stopped the offending instance

~$ fly m clone -a flxwebsites-db-staging  7328725ec43859
Cloning machine 7328725ec43859 into region ewr
Provisioning a new machine with image flyio/postgres:14.4...
  Machine 9185ee3c429583 has been created...
  Waiting for machine 9185ee3c429583 to start...
  Waiting for 9185ee3c429583 to become healthy (started, 3/3)
Machine has been successfully cloned!

~$ fly m stop -a flxwebsites-db-staging 21781e60b06896
Sending kill signal to machine 21781e60b06896...
21781e60b06896 has been successfully stopped

I didn’t remove it just in case, but feel free to do so with fly machine remove anytime.

nicksergeant · November 26, 2022, 3:10am

Thanks! What’s the command you used to do the health check on the individual instances?

dangra · November 26, 2022, 3:14am

fly checks list

nicksergeant · November 26, 2022, 3:15am

thanks!

ignoramous · November 26, 2022, 8:43am

A (regular, non-db) machine I have got that has entered this state refuses to be removed: Machines error blocking deploys: No responders available for request

What gives? I want it gone too as it’s blocking our use of flyctl for deploys.

I guess everyone should tweet than create posts here seems super effective: https://archive.is/TFGIo

nicksergeant · November 26, 2022, 9:57pm

@dangra is there a reason that deploying the configuration Production - Highly available, 2x shared CPUs, 4GB RAM, 40GB disk in EWR continues to only result in one healthy instance?

I tried again this morning to create this cluster (I’m working on creating repeatable scripts for my infra, so I want to make sure this is consistent), and I still only have one instance healthy:

❯ fly checks list -a flxwebsites-db
Health Checks for flxwebsites-db
  NAME | STATUS   | MACHINE        | LAST UPDATED | OUTPUT
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  pg   | passing  | 0e286961b50867 | 4m36s ago    | [✓] transactions: read/write (341.43µs)
       |          |                |              | [✓] connections: 9 used, 3 reserved, 300 max (4.39ms)
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  role | passing  | 0e286961b50867 | 4m37s ago    | leader
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  vm   | passing  | 0e286961b50867 | 5m3s ago     | [✓] checkDisk: 37.06 GB (94.7%) free space on /data/ (86.66µs)
       |          |                |              | [✓] checkLoad: load averages: 0.00 0.00 0.00 (181.59µs)
       |          |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (96.84µs)
       |          |                |              | [✓] cpu: system spent 0s of the last 60s waiting on cpu (82.71µs)
       |          |                |              | [✓] io: system spent 0s of the last 60s waiting on io (73.62µs)
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  pg   | critical | 21781e67a03896 | 4m43s ago    | 500 Internal Server Error
       |          |                |              | failed to connect to proxy: context deadline exceeded
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  role | critical | 21781e67a03896 | 4m32s ago    | 500 Internal Server Error
       |          |                |              | failed to connect to local node: context deadline exceeded
-------*----------*----------------*--------------*--------------------------------------------------------------------------
  vm   | passing  | 21781e67a03896 | 3m57s ago    | [✓] checkDisk: 37.06 GB (94.7%) free space on /data/ (67.84µs)
       |          |                |              | [✓] checkLoad: load averages: 0.00 0.00 0.00 (68.78µs)
       |          |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (79.78µs)
       |          |                |              | [✓] cpu: system spent 144ms of the last 60s waiting on cpu (31.2µs)
       |          |                |              | [✓] io: system spent 384ms of the last 60s waiting on io (22.64µs)
-------*----------*----------------*--------------*--------------------------------------------------------------------------

nicksergeant · November 27, 2022, 9:04pm

I’m pretty sure the problem here is the ewr region. I switched to iad and everything works perfectly fine. Launching a new Postgres 1.7 app with DB results in "Error release command failed, deployment aborted" - #8 by nicksergeant

dangra · November 28, 2022, 8:09pm

True, a misconfigured network interface on one host of ewr region was causing issues to apps whose machines were assigned to run there. It is fixed now. thanks.

nicksergeant · November 28, 2022, 9:09pm

@dangra thank you!

Topic		Replies	Views
fly postgres create in FRA consistently failing	9	470	August 10, 2021
Postgres DB is not reachable Questions / Help postgres	27	2118	February 21, 2023
flyctl postgres create failed postgres	3	368	November 2, 2022
Unable to connect to postgres	14	3022	November 22, 2022
Can't clone Postgres machine postgres	14	382	February 12, 2024

Error failed to launch VM: nats: no responders available for request

Related topics