Hi everybody, we are hosting a couple of shopify apps on fly.io.
due to the nature of the app itself there’s no need for a machine to be always up (the app just let you tweak some settings and once you’ve done it the user maybe use it once in a week or so).
So when we incurred in the 502 error from the machine not waking up fast enough we decided to try something different from the single machine being always up and implemented an health check with a route that returns 200, a grace period of 20s and the retry of 10s
This seems to work, we got no more 502 errors but after 48 hourse we noticed that it returns a generic 500 sometimes.
I want to ask what are the best practice here, maybe the grace period is too long?