Zombie/orphaned application server? Network problems?

Hi there, we have 2 clustered phoenix application servers (app name is beacon).

About an hour ago we began to see errors because one of them seemed to fall over (instance ID a512e995-caf8-f6ba-3700-92927a1cf1d8). What this means for users is that about every other request results in a 500.

Our attempts to scale the VM count up or down for this application doesn’t seem to have any effect, and we cannot seem to ssh into the instance with fly ssh console -a beacon to try to restart it manually. All of our app/DB instances are in the dfw region.

According to https://status.flyio.net/ there aren’t any issues, but it seems like something outside our control is happening. Do you have any advice?

Some of our other error logs:

  • [libcluster:fly6pn] unable to connect to :"beacon@fdaa:0:4b40:a7b:12de:0:c08b:2"
  • Postgrex.Protocol (#PID<0.3496.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.beacon-db.internal:5432): timeout

Update: We’ve attempted to restart the VM with fly vm status a512e995 -a beacon but got
Error failed to restart allocation: You hit a Fly API error with request ID: 01GFBVRYVDQC2T73EAX5VYK2QE-dfw

Update 2: We tried to stop it with fly vm stop a512e995 -a beacon, and can see that it’s lost:

  ID            = a512e995            
  Process       = app                 
  Version       = 518                 
  Region        = dfw                 
  Desired       = stop                
  Status        = lost                
  Health Checks = 1 total, 1 passing  
  Restarts      = 0                   
  Created       = 19h56m ago          

Events
TIMESTAMP           	TYPE      	MESSAGE                 
2022-10-13T22:27:32Z	Received  	Task received by client	
2022-10-13T22:28:00Z	Task Setup	Building Task Directory	
2022-10-13T22:28:03Z	Started   	Task started by client 	

Checks
ID                              	SERVICE 	STATE  	OUTPUT                                 
3df2415693844068640885b45074b954	tcp-8080	passing	TCP connect 172.19.9.130:8080: Success	

Update 3: We attempted to delete the volume this lost VM was originally attached to, but got an error:

$ fly volumes delete vol_ke628r63261rwmnp          
Update available 0.0.399 -> v0.0.413.
Run "fly version update" to upgrade.
Deleting a volume is not reversible.
? Are you sure you want to delete this volume? Yes
Error failed deleting volume: upstream service is unavailable
2 Likes

Hi @jmill,

There was a networking-related failure affecting a single server in dfw, we’re currently working on recovery. Most apps were automatically restarted on other hosts, but since your app instance had an attached volume created on this particular host it will be unavailable (status lost) until the host is recovered. You should be able to create a new volume and scale new instances, but the existing VM might be in limbo until the host comes back online.

1 Like

Ok, thank you for the update!

We were originally using volumes to “pin” our app instances to the DFW region. If we deploy a change to remove the volumes and use flyctl regions to do the same thing, would the fly proxy skip this lost instance when routing requests?

The server is back online now, so the VM should be responsive again- let me know if you’re still having any issues.

To answer your question, in most cases the proxy routes around VMs that have lost connectivity or are failing health checks, so it should skip any ‘lost’ instances when routing requests. (There’s a small possibility that a running VM could become unreachable from the scheduler but somehow still reachable by the proxy- I’m not sure what would happen in a rare edge case like that.)

2 Likes