Persistent "unhealthy allocations" and "error registering 6pn service" errors on deploy or secret set

When either (re)deploying or setting a new secret value for one of my apps, I’ve been consistently getting an unhealthy allocations failure like this one:

$ flyctl secrets set AUTH_TOKENS=REDACTED
Release v35 created
Monitoring Deployment

2 desired, 1 placed, 0 healthy, 1 unhealthy
v35 failed - Failed due to unhealthy allocations - not rolling back to stable job version 35 as current job has same specification
***v35 failed - Failed due to unhealthy allocations - not rolling back to stable job version 35 as current job has same specification and deploying as v36

This happens whether deploying a new commit or the same commit that has been running without issue for many days now, so I don’t think this is actually an issue with health checks and I cannot see any problems in the app’s logs.

If I look at flyctl vm status output for one of the failed instances, I see a Driver Failure rpc error: code = Unknown desc = error registering 6pn service Put “https://127.0.0.1:8501/v1/agent/service/register”: unexpected EOF error, like so:

flyctl vm status 9e7275c6
Instance
  ID            = 9e7275c6   
  Task          =            
  Version       = 35         
  Region        = ewr        
  Desired       = stop       
  Status        = failed     
  Health Checks =            
  Restarts      = 0          
  Created       = 8m59s ago  

Recent Events
TIMESTAMP            TYPE            MESSAGE                                                                                                                               
2021-08-29T01:35:40Z Received        Task received by client                                                                                                               
2021-08-29T01:35:40Z Task Setup      Building Task Directory                                                                                                               
2021-08-29T01:35:41Z Driver Failure  rpc error: code = Unknown desc = error registering 6pn service Put "https://127.0.0.1:8501/v1/agent/service/register": unexpected EOF 
2021-08-29T01:35:41Z Not Restarting  Error was unrecoverable                                                                                                               
2021-08-29T01:35:41Z Alloc Unhealthy Unhealthy because of failed task                                                                                                      
2021-08-29T01:35:42Z Killing         Sent interrupt. Waiting 5s before force killing                                                                                       

Checks
ID SERVICE STATE OUTPUT 

Recent Logs

FWIW, I tried flyctl vm stop on each of the healthy instances I had running when I ran into this issue tonight, and new instances were able to start successfully and stay healthy.

The app in question, if it helps, is urlresolverapi-production.

I’m out of ideas at the moment, and hoping I haven’t just done something silly on my side here! I’m happy to provide any more context that would be useful in tracking this down.

Thanks,
Will.

A bit more oddness:

I’m now seeing an unexplained spike in 404 responses from my app and wondering whether this is related to my ongoing deploy issues:

This is a deviation from my app’s normal traffic patterns over the last couple of months, and looks like it started around 11am Eastern today:

I’d chalk it up to bots or something, but I do not see evidence of these 404 responses in my app’s logs or other instrumentation, so I’m wondering if this is somehow related to my other issues.

We’re looking into the first issue. It seems like there’s a problem communicating with our state store. Looks like this is specifically happening in EWR.

Concerning the 404 status codes: our proxy does not return a 404 error for any reason. The only status codes it might return is a 502 or 503 and you’d see something in the logs with an error message (brief) and code.

The EWR issue was a transient issue that may be fixed now? We’re working on overhauling this bit, it should get much better soon.

Thanks, @jerome, it does seem to have recovered yesterday. The new “issue” I’m seeing is that now the two instances of my app are persistently placed in separate regions, though I suppose this probably warrants a separate thread!

Concerning the 404 status codes: our proxy does not return a 404 error for any reason. The only status codes it might return is a 502 or 503 and you’d see something in the logs with an error message (brief) and code.

This continues to mystify me. The only place I can see any evidence of those 404 status codes is in fly.io’s own metrics view, and they only appeared while the ewr deployment was struggling (the gaps here are from when I failed over to a secondary set of instances deployed as a separate fly app in a different region):

Anyway, thanks for your help sorting this out!