Any issue with the LHR region at the moment?

I have a few apps. The ones only in LHR don’t seem to be responding.

I have one in LHR and IAD (2 instance). All requests seem to be handled by IAD.
I have one in LHR and CDG (2 instance). All requests seem to be handled by CDG.

e.g looking at the logs, this stands out:

2022-01-05T22:07:44.838 proxy[13e0beb0] lhr [error] Error 2: Internal problem

… as since then, the app (that’s a one instance app, in LHR) seems dead. Logs stopped about 90 minutes ago. 525 error (because Cloudflare is in front and it can’t contact it, so returns a 525 error - the SSL bit of that error is misleading).

No volumes attached to any app, or any database. They are simple NodeJS apps.

I looked at metrics and they all look fine, minimal load, no spike.

I looked at the activity tab, and every app seems to return a 500 error. But that happens even with the apps that are working, so I assume is not connected:

Screenshot 2022-01-05 at 23.18.36

Hmm.

Yeah, LHR appears to be the common issue for the errors. I haven’t deployed for days, no code changes, so nothing on my end to break them.

fly --app NAME status

The two-instance app:

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS      	RESTARTS	CREATED              
2e15c859	app    	218    	lhr   	run    	pending	1 total, 1 critical	0       	2022-01-01T21:20:23Z	
5328b10e	app    	218    	cdg(B)	run    	running	1 total, 1 passing 	0       	2022-01-01T21:20:23Z

Its sibling one-instance app:

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS      	RESTARTS	CREATED              
13e0beb0	app    	395    	lhr   	run    	running	1 total, 1 critical	0       	2022-01-01T21:06:38Z

Not clear why the health check is failing. It’s just doing a simple http response. And I’ve not changed anything. Maybe the host it’s running on? Weird.

Ah, yes, there was an issue.

e.g I saw this line, and not changed anything for days at my end:

2022-01-05T22:09:26.515 proxy[13e0beb0] lhr [error]Health check status changed 'passing' => 'critical'

So I restarted it manually to see what would happen and sure enough, a new log line appeared, and it resumed and is now working again.

...
2022-01-05T22:10:00.989 app[13e0beb0] lhr [info]POST /TEST 200 96 - 23.033 ms
2022-01-05T23:41:58.328 runner[13e0beb0] lhr [info]Shutting down virtual machine
...

So it was down for 90 minutes or so, based on that.

Would it have been moved to another region if I did have a backup region set based on the failure? Or what would trigger that? I removed the backup region.

We’re doing emergency maintenance on a couple of hosts in London. Are you using volumes on these apps? If you’re not using volumes, your VMs should get rescheduled. The health check failures probably aren’t caused by this, though.

If you run fly vm status 13e0beb0 on the one instance app, what does it say about health checks?

Ah … that could be it then.

Not using any volumes, on any apps.

The output for that command is interesting. It does point to an issue: Failed Restoring Task failed to restore task at that time.

I manually did a restart 90 minutes later and that worked fine without changing any code. The healthchecks passed and are fine now. Which suggests it was the host.

Instance
  ID            = 13e0beb0              
  Process       =                       
  Version       = 395                   
  Region        = lhr                   
  Desired       = run                   
  Status        = running               
  Health Checks = 1 total, 1 passing    
  Restarts      = 1                     
  Created       = 2022-01-01T21:06:38Z  

Recent Events
TIMESTAMP            TYPE                  MESSAGE                                                        
2022-01-01T21:06:31Z Received              Task received by client                                        
2022-01-01T21:06:31Z Task Setup            Building Task Directory                                        
2022-01-01T21:06:35Z Started               Task started by client                                         
2022-01-05T22:07:38Z Failed Restoring Task failed to restore task; will not run until server is contacted 
2022-01-05T22:08:20Z Started               Task started by client                                         
2022-01-05T23:41:58Z Restart Signaled      User requested restart                                         
2022-01-05T23:41:59Z Terminated            Exit Code: 0                                                   
2022-01-05T23:41:59Z Restarting            Task restarting in 0s                                          
2022-01-05T23:42:07Z Started               Task started by client                                         

Checks
ID                               SERVICE  STATE   OUTPUT                                
50f1daa51d97c995477d21c1927c1a24 tcp-8080 passing TCP connect 172.19.0.34:8080: Success 

Oh that’s a problem, that should have been rescheduled. It looks like it just didn’t reschedule those instances.

We’ll look into this, it should not have done that.

1 Like

Ah …

That explains it then. No problem. Thanks for investigating.

No problem! Thank you for letting us know, I’m not sure we would have noticed this otherwise. It’s a pretty quick fix on our end – we just need to ensure that VMs are rescheduled before we take host hardware down.

1 Like