Do they not have any staff in the GMT/CET timezone anymore?
It’s been hours, we’ve sent emails, we’re still having the same issues, constantly trying spin up new machines, wait until they slowly die, spin up new ones…
Same situation here: one app that’s down because I deployed during the incident, and one app that’s working because I left it alone during the incident. I’m probably going to hold off on attempting any deploys to that working app for a while!
Going to host on a new platform in the meantime. As this is pretty unreliable not even having an ETA for a fix here.
Not too hard to keep the status page properly updated when you are serving this many customers.
We’ve just deployed to an app that was working, and it now broken (it’s running, but the health check fails). I think the problem is its connection to Upstash Redis.
I sort of get this as well, causing downtime for 15 minutes and then it comes back to “sort-of” normal.
Yeah, I’m still having issues.
Error: release command failed - aborting deployment. error running release_command machine: error updating release_command machine: failed to update VM 3d8d503b97e568: invalid_argument: unable to update machine configured for auto destroy (Request ID: 01JDM8DT8X7SRAWAV0VS515Q8M-iad) (Trace ID: 9792fa700168d7617cfe71c310005e60)
For anyone with something that looks similar, I found a machine in the dashboard that was not in a started state and I destroyed it and my deploy went through.
We’ve just seen our two instances recover after about 1 hour and 15 minutes. The two other instances I’d tried creating (1 in an existing region, 1 in a new one) some time after didn’t.
Confirmed that this worked for me! Thanks.
I’m now thinking that the problems we’ve experienced aren’t related the downtime, but that CPU quotas have been turned on:
We have an HTTP cache stored in a volume, and run a cleanup process on startup to prune old entries (cacache’s verify
). This was known to be a bit intensive, but only lasted for a minute or two.
Hidden in that thread is:
So I think that on the deploy the CPU quota balance is reset to 0, the intensive process started and was immediately throttled (which caused the HTTP server running on the instance to crawl to a halt). Once the throttled task eventually completed, the throttling was lifted allow the HTTP server to run as expected.
I’ve not been able to recreate this with a machine restart as the balance is kept (it consumes some of it, but doesn’t get near 0). I’ll have to confirm by turning off the intensive task (or at least delay it starting), and see how that deploy goes…
My app’s now back online. Haven’t done anything myself so it seems like either they’ve done something in the past few hours or whatever they did during the night took a while to fully take effect.