Hi there, we have 2 clustered phoenix application servers (app name is beacon
).
About an hour ago we began to see errors because one of them seemed to fall over (instance ID a512e995-caf8-f6ba-3700-92927a1cf1d8
). What this means for users is that about every other request results in a 500.
Our attempts to scale the VM count up or down for this application doesn’t seem to have any effect, and we cannot seem to ssh into the instance with fly ssh console -a beacon
to try to restart it manually. All of our app/DB instances are in the dfw
region.
According to https://status.flyio.net/ there aren’t any issues, but it seems like something outside our control is happening. Do you have any advice?
Some of our other error logs:
[libcluster:fly6pn] unable to connect to :"beacon@fdaa:0:4b40:a7b:12de:0:c08b:2"
Postgrex.Protocol (#PID<0.3496.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.beacon-db.internal:5432): timeout
Update: We’ve attempted to restart the VM with fly vm status a512e995 -a beacon
but got
Error failed to restart allocation: You hit a Fly API error with request ID: 01GFBVRYVDQC2T73EAX5VYK2QE-dfw
Update 2: We tried to stop it with fly vm stop a512e995 -a beacon
, and can see that it’s lost:
ID = a512e995
Process = app
Version = 518
Region = dfw
Desired = stop
Status = lost
Health Checks = 1 total, 1 passing
Restarts = 0
Created = 19h56m ago
Events
TIMESTAMP TYPE MESSAGE
2022-10-13T22:27:32Z Received Task received by client
2022-10-13T22:28:00Z Task Setup Building Task Directory
2022-10-13T22:28:03Z Started Task started by client
Checks
ID SERVICE STATE OUTPUT
3df2415693844068640885b45074b954 tcp-8080 passing TCP connect 172.19.9.130:8080: Success
Update 3: We attempted to delete the volume this lost VM was originally attached to, but got an error:
$ fly volumes delete vol_ke628r63261rwmnp
Update available 0.0.399 -> v0.0.413.
Run "fly version update" to upgrade.
Deleting a volume is not reversible.
? Are you sure you want to delete this volume? Yes
Error failed deleting volume: upstream service is unavailable