fly.io site is currently inaccessible...

I’m now thinking that the problems we’ve experienced aren’t related the downtime, but that CPU quotas have been turned on:

We have an HTTP cache stored in a volume, and run a cleanup process on startup to prune old entries (cacache’s verify). This was known to be a bit intensive, but only lasted for a minute or two.

Hidden in that thread is:

So I think that on the deploy the CPU quota balance is reset to 0, the intensive process started and was immediately throttled (which caused the HTTP server running on the instance to crawl to a halt). Once the throttled task eventually completed, the throttling was lifted allow the HTTP server to run as expected.

I’ve not been able to recreate this with a machine restart as the balance is kept (it consumes some of it, but doesn’t get near 0). I’ll have to confirm by turning off the intensive task (or at least delay it starting), and see how that deploy goes…

Edit: testing with a 10 minute delay

Edit 2: the deploy happened, the CPU quota balance dropped but the HTTP server handles requests fine. The balance is now replenishing:

Edit 3: the intensive task ran successfully with enough balance in place, though there was some ‘stealing’: