I’m using Fly through the terminal these days, and the situation is a bit annoying. The success rate of some command is around 30%. If I do a fly deploy, it is very common for it to stop in the middle of nowhere and not return any error messages. If I do a fly status, it is very common for it to only return something after a few attempts. I don’t even have an error message to guide me. He just stops. With a lot of insistence I manage to carry out the operations.
Is there something wrong with my setup or is flyctl quite buggy?
Is there a possibility that the change on friday broke other stuff? As I and others reported our apps are stuck in the suspended state and we’re not able to bring it back:
This really sucks because I have demos scheduled for tomorrow and my app is down for no obvious reasons. There has been no change, no deployment on my side past 3 weeks and it was running without any issues until very recently.
I literally had to delete the instance that was in a broken state after the broken deployment. Nothing flyctl worked, only deleting and recreating the instance did. In the process I lost all my secrets and I wasted over 2 hours, that I didn’t have in the first place.
Yeah. I hate that this problem slipped through. This issue only came up under a high volume of traffic, essentially meaning that it could’ve only been discovered in production. Extremely frustrating for us, and surely for you as well.
I’d really appreciate, if you feel comfortable doing so, a little bit of context about the command you ran on the affected version of flyctl that caused things to break. Things like, potentially, a fly.toml file, and/or what command you ran. Internally, we’ve been tracking this as just “flyctl hanging,” but if it’s moving instances between regions there’s something else afoot and additional context would definitely help narrow down what else was happening.
Same for you @spodercat. I’d love some context, if you can provide it, so I can try to dig into this and prevent it from happening again.
On a semi related note, I’m experiencing another weird deployment issue.
My app had two instances, primary in lax, and one in syd.
To test the global infra, I added some instances from Europe (tried ams, cdg and lhr) - for some reason, these new EU instances often fail to deploy successfully. From the logs I can see that the “/” path has been getting 200s on the these instances, but somehow the health check was not reporting correctly?
I must’ve tried almost 10 deployments now, only 1 deployment went through… The EU instances often get stuck, like this:
Waiting for all green machines to start
Machine 4d89172dae6687 [app] - started
Machine 568360eb757e98 [app] - created
Machine 6e82d377ad0438 [app] - started
Machine 91857614c23683 [app] - started
Machine 9185956b1d3383 [app] - started
Machine e784e241a63d78 [app] - started
In this particular case 568360eb757e98 is lhr (so is 6e82d377ad0438, which started fine).
In one deployment, it failed again and somehow left me with all the blue instances plus the successful green instances running… smh :\
Haven’t heard back from the email sent yet. Just making a note that this is still happening. Just took me five tries to get a successful deployment. There’s no clear indication in the logs to suggest why instances fail to report health.