Repeated deploy failures - context deadline exceeded

I currently deploy my app to Fly using Github actions. I’ve been doing this for months with no problems, then all of a sudden today I get repeated failed deploys due to unhealthy allocations.

My app is in 3 different regions, but it’s only the singapore instance that has a critical health check.
Running flyctl vm status <VM_ID> gives me the following:

Checks
ID                                      SERVICE         STATE           OUTPUT
d6dd6a7392c47a522d5161aff2bffadd        tcp-8080        passing         TCP connect 172.19.11.98:8080: Success
a2e7afa2d12c0201d50537ca27e5bf21        tcp-8080        critical        Get "http://172.19.11.98:8080/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

But then when I look at the logs this health check appears to be passing:

  2022-12-03T20:39:21Z   [info]GET /healthcheck - - - - ms
  2022-12-03T20:39:21Z   [info]HEAD / 200 - - 2056.722 ms
  2022-12-03T20:39:33Z   [info]GET /healthcheck - - - - ms
  2022-12-03T20:39:33Z   [info]HEAD / 200 - - 2037.984 ms
  2022-12-03T20:39:45Z   [info]GET /healthcheck - - - - ms
  2022-12-03T20:39:45Z   [info]HEAD / 200 - - 2034.704 ms
  2022-12-03T20:39:57Z   [info]GET /healthcheck - - - - ms
  2022-12-03T20:39:57Z   [info]HEAD / 200 - - 2045.498 ms
  2022-12-03T20:40:09Z   [info]GET /healthcheck - - - - ms
  2022-12-03T20:40:09Z   [info]HEAD / 200 - - 2032.098 ms
  2022-12-03T20:40:21Z   [info]GET /healthcheck - - - - ms
  2022-12-03T20:40:21Z   [info]HEAD / 200 - - 2035.128 ms
  2022-12-03T20:40:33Z   [info]GET /healthcheck - - - - ms
  2022-12-03T20:40:33Z   [info]HEAD / 200 - - 2040.146 ms
  2022-12-03T20:40:45Z   [info]GET /healthcheck - - - - ms
  2022-12-03T20:40:45Z   [info]HEAD / 200 - - 2037.384 ms
  2022-12-03T20:40:57Z   [info]GET /healthcheck - - - - ms
  2022-12-03T20:40:57Z   [info]HEAD / 200 - - 2045.686 ms

I’ve tried restarting the VM. I’ve increased the grace period. I’m struggling to get to the bottom of this one so any help would be appreciated, thanks in advance.

Hi @Fulfilled this is totally a guess but is >2 seconds allowed for your health check response time?

What kind of app is it? Where do you see the context deadline exceeded?

Thanks for your response @mwills I’ve set the grace period to 30 just to rule that out.

It’s a blog. I see the context deadline exceeded when I do flyctl checks list. And it causes my Github action to fail because it’s unable to deploy. I have 3 instances in total, the ones in lhr and den are fine, it’s the sin one that is causing issues for some reason.

Are you able to see the logs from the corresponding health checks on that specific vm around that time? You might be able to get some more history with that instance directly.

Which blog app is it? There are quite a few. Are you using anything like prometheus?