A machine on one of my apps is messed up and I can no longer deploy the app. It also no longer is functioning properly (cannot be loaded at the url). I tried to destroy the machine in the dashboard but it fails.
Can someone help me with this?
Deploy attempt
Failed to update machines: failed to update machine e28603db77e938: failed to destroy VM e28603db77e938: not_found: machine not found Retrying…
Failed to update machines: failed to update machine e28603db77e938: failed to destroy VM e28603db77e938: not_found: machine not found Retrying…
Failed to update machines: failed to update machine e28603db77e938: failed to destroy VM e28603db77e938: not_found: machine not found Retrying…
Failed to update machines: failed to update machine e28603db77e938: failed to destroy VM e28603db77e938: not_found: machine not found Retrying…
Failed to update machines: failed to update machine e28603db77e938: failed to destroy VM e28603db77e938: not_found: machine not found Retrying…
Failed to update machines: failed to update machine e28603db77e938: failed to destroy VM e28603db77e938: not_found: machine not found Retrying…
Failed to update machines: failed to update machine e28603db77e938: failed to destroy VM e28603db77e938: not_found: machine not found-
Cleared lease for e28603db77e938
Error: failed to update machine e28603db77e938: failed to destroy VM e28603db77e938: not_found: machine not found (Request ID: 01JAPMED4PZD03Z4ASZ6GTG885-sea) (Trace ID: 616120d3607da2fe72d396c8dfa629f7)
CLI destroy attempt
fly machine destroy e28603db77e938 --force
Error: machine e28603db77e938 was not found in app ‘hyperserve-docs-dev’
It would also be helpful if, when you run fly deploy, it would warn you doing so will break your app. It was running fine before the deploy. I could have waited to release instead of having an outage.
I have no idea how fly has such bad observability of their own infra. I could beat their ops team to every single outage by simply watching a script that monitors the forum activity. In 2024 there’s no excuse for this other than bad culture around product and deployments.
The last one regarding the postgres connections was similar. It was over 12 hours between when it was first posted in the forums (by me) and their status page updated.
They actually do always do so now, in the Infrastructure Log. For example, the September 1 global outage was explained down to the level of individual lines of Rust code.
Today’s probably won’t be covered there until next Tuesday (October 29), though, since it’s a weekly update tempo…
Feels pretty disingenuous to say things like ‘parts of the api are down’ when for a non-zero percent of us what that actually means is ‘production server is down and the mechanism to recover is also down’. I guess it just doesn’t roll off the tongue the same… I mean I get it, but my customer (who’s app is down) doesn’t
Exactly. Their last update really made me mad. What parts of the API are down, which are up? Because from my perspective the whole thing is broken, and has been for hours now.
My original problem was unrelated to the current discussion I think. When I paid for a support plan to fix the issue I was facing (unresponsive machine), they fixed that and then the APIs went down. Painful.
I had been experiencing the original issue since Friday last week, but now I’m still unable to move forward and it’s Tuesday.
I’m wondering why fly doesn’t know when a machine goes down like it did, for days, before I have to pay for support to tell them.