I’m using Fly through the terminal these days, and the situation is a bit annoying. The success rate of some command is around 30%. If I do a fly deploy, it is very common for it to stop in the middle of nowhere and not return any error messages. If I do a fly status, it is very common for it to only return something after a few attempts. I don’t even have an error message to guide me. He just stops. With a lot of insistence I manage to carry out the operations.
Is there something wrong with my setup or is flyctl quite buggy?
Is there a possibility that the change on friday broke other stuff? As I and others reported our apps are stuck in the suspended state and we’re not able to bring it back:
This really sucks because I have demos scheduled for tomorrow and my app is down for no obvious reasons. There has been no change, no deployment on my side past 3 weeks and it was running without any issues until very recently.
Good to know about this option… I was having issues last night (~8 hours ago) too. The deployments hung after displaying the message “Creating green machines”.
Somehow it ended up destroying my instance in the primary region, and created two new instances in my second region.
I’m just glad that my app isn’t in production yet, but this is very worrying… There’s been a number of hiccups since I started trying out Fly only a few weeks ago.
I literally had to delete the instance that was in a broken state after the broken deployment. Nothing flyctl worked, only deleting and recreating the instance did. In the process I lost all my secrets and I wasted over 2 hours, that I didn’t have in the first place.
Yeah. I hate that this problem slipped through. This issue only came up under a high volume of traffic, essentially meaning that it could’ve only been discovered in production. Extremely frustrating for us, and surely for you as well.
I’d really appreciate, if you feel comfortable doing so, a little bit of context about the command you ran on the affected version of flyctl that caused things to break. Things like, potentially, a fly.toml file, and/or what command you ran. Internally, we’ve been tracking this as just “flyctl hanging,” but if it’s moving instances between regions there’s something else afoot and additional context would definitely help narrow down what else was happening.
Same for you @spodercat. I’d love some context, if you can provide it, so I can try to dig into this and prevent it from happening again.
When the deployments hung there had been no changes to my toml file. Unfortunately I didn’t know about the LOG_LEVEL at the time so I couldn’t really tell what caused the hang…
On a semi related note, I’m experiencing another weird deployment issue.
My app had two instances, primary in lax, and one in syd.
To test the global infra, I added some instances from Europe (tried ams, cdg and lhr) - for some reason, these new EU instances often fail to deploy successfully. From the logs I can see that the “/” path has been getting 200s on the these instances, but somehow the health check was not reporting correctly?
I must’ve tried almost 10 deployments now, only 1 deployment went through… The EU instances often get stuck, like this:
Waiting for all green machines to start
Machine 4d89172dae6687 [app] - started
Machine 568360eb757e98 [app] - created
Machine 6e82d377ad0438 [app] - started
Machine 91857614c23683 [app] - started
Machine 9185956b1d3383 [app] - started
Machine e784e241a63d78 [app] - started
In this particular case 568360eb757e98 is lhr (so is 6e82d377ad0438, which started fine).
In one deployment, it failed again and somehow left me with all the blue instances plus the successful green instances running… smh :\
You can use fly machines destroy to destroy the green machines that never got destroyed because the deployment failed.
Ideally it should cleanup if the deployment fails. I suspect it failed to cleanup because of the same reasons why this thread was created. We are looking into it.
Haven’t heard back from the email sent yet. Just making a note that this is still happening. Just took me five tries to get a successful deployment. There’s no clear indication in the logs to suggest why instances fail to report health.