What is happening with flyctl?

I’m using Fly through the terminal these days, and the situation is a bit annoying. The success rate of some command is around 30%. If I do a fly deploy, it is very common for it to stop in the middle of nowhere and not return any error messages. If I do a fly status, it is very common for it to only return something after a few attempts. I don’t even have an error message to guide me. He just stops. With a lot of insistence I manage to carry out the operations.

Is there something wrong with my setup or is flyctl quite buggy?

1 Like

Hey there,

Can you try LOG_LEVEL=debug fly ... w/ your commands? It should help figure out where the issue is happening.

If that doesn’t help, you can also run the agent in debug mode (in another terminal session):

fly agent stop
LOG_LEVEL=debug fly agent run

Doing both should give a ton of information the next time the issue happens.

We shipped a feature that seemingly worked, and only showed issues in production. Deploying on Friday is always a bad idea.

Sorry this slipped through! The latest release (0.1.53) should have this bug fixed.

3 Likes

Is there a possibility that the change on friday broke other stuff? As I and others reported our apps are stuck in the suspended state and we’re not able to bring it back:

This really sucks because I have demos scheduled for tomorrow and my app is down for no obvious reasons. There has been no change, no deployment on my side past 3 weeks and it was running without any issues until very recently.

Good to know about this option… I was having issues last night (~8 hours ago) too. The deployments hung after displaying the message “Creating green machines”.

Somehow it ended up destroying my instance in the primary region, and created two new instances in my second region.

I’m just glad that my app isn’t in production yet, but this is very worrying… There’s been a number of hiccups since I started trying out Fly only a few weeks ago.

Same. 2 weeks without deploying, added a console.log. Server went down, deploys don’t work.

I should be going live TOMORROW. I can’t believe the official response was something along the lines of “OOPSIE WOOPSIE WE DID A FUCKY WUCKY”.

Guess I’ll be migrating to Google Cloud.

I literally had to delete the instance that was in a broken state after the broken deployment. Nothing flyctl worked, only deleting and recreating the instance did. In the process I lost all my secrets and I wasted over 2 hours, that I didn’t have in the first place.

This whole situtation is downright unacceptable.

Yeah. I hate that this problem slipped through. This issue only came up under a high volume of traffic, essentially meaning that it could’ve only been discovered in production. Extremely frustrating for us, and surely for you as well.

I’d really appreciate, if you feel comfortable doing so, a little bit of context about the command you ran on the affected version of flyctl that caused things to break. Things like, potentially, a fly.toml file, and/or what command you ran. Internally, we’ve been tracking this as just “flyctl hanging,” but if it’s moving instances between regions there’s something else afoot and additional context would definitely help narrow down what else was happening.

Same for you @spodercat. I’d love some context, if you can provide it, so I can try to dig into this and prevent it from happening again.

Once again: sorry about this.

1 Like

I simply ran fly deploy, my toml file:

app = "REDACTED"
primary_region = "lax"
kill_signal = "SIGTERM"
kill_timeout = 300

[deploy]
  strategy = "bluegreen"

[processes]
  app = "/app/bin/entrypoint"

[env]
  PHX_HOST = "REDACTED"
  PORT = "8080"
  RELEASE_COOKIE = "REDACTED"
  PRIMARY_REGION = "lax"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ["app"]
  [http_service.concurrency]
    type = "requests"
    hard_limit = 1000
    soft_limit = 800
  [[http_service.checks]]
    grace_period = "5s"
    interval = "2s"
    method = "GET"
    timeout = "1s"
    path = "/"

When the deployments hung there had been no changes to my toml file. Unfortunately I didn’t know about the LOG_LEVEL at the time so I couldn’t really tell what caused the hang…

1 Like

On a semi related note, I’m experiencing another weird deployment issue.

My app had two instances, primary in lax, and one in syd.

To test the global infra, I added some instances from Europe (tried ams, cdg and lhr) - for some reason, these new EU instances often fail to deploy successfully. From the logs I can see that the “/” path has been getting 200s on the these instances, but somehow the health check was not reporting correctly?

I must’ve tried almost 10 deployments now, only 1 deployment went through… The EU instances often get stuck, like this:

Waiting for all green machines to start
  Machine 4d89172dae6687 [app] - started
  Machine 568360eb757e98 [app] - created
  Machine 6e82d377ad0438 [app] - started
  Machine 91857614c23683 [app] - started
  Machine 9185956b1d3383 [app] - started
  Machine e784e241a63d78 [app] - started

In this particular case 568360eb757e98 is lhr (so is 6e82d377ad0438, which started fine).

In one deployment, it failed again and somehow left me with all the blue instances plus the successful green instances running… smh :\

I have saved the output of a debug deployment, please let me know where to send it.

My scaling was set to 6 instances, but because it failed again, adding green without destroying blue, I got this message during the deployment:

"error": "To create more than 20 machines per organization please contact billing@fly.io"

:see_no_evil:

You can use fly machines destroy to destroy the green machines that never got destroyed because the deployment failed.

Ideally it should cleanup if the deployment fails. I suspect it failed to cleanup because of the same reasons why this thread was created. We are looking into it.

I’ll send a message and you can dump the debug logs there!

Thanks @kwaw .

Haven’t heard back from the email sent yet. Just making a note that this is still happening. Just took me five tries to get a successful deployment. There’s no clear indication in the logs to suggest why instances fail to report health.

@fredwu do you have more success with other deployment strategies? I’m trying to figure out if this is a bluegreen thing or a platform thing.

Interestingly, I just tried both bluegreen and rolling a bunch of times - bluegreen failed a few times whereas rolling never failed.

So perhaps this is indeed something to do with the bluegreen strategy. :thinking:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.