After New Deploy - Site Is No Longer Available

After deploying an app (no changes to the app) - this was done during a CI/CD pipeline setup, the app is no longer accepting requests.

I can see from the logs that everything is up, no errors, nor is the site showing logs when hitting it.

It seems the site no longer has [[services]] attached to it? as it is no longer accessible by the outside world.

Any ideas? This has completely taken down our app, so this is pretty concerning as we continue moving everything to fly.

Thanks!

Sounds similar to my issue from here: Elixir: Infinite Loop when running Config.Provider? - #14 by ksluszniak. There too @mikehostetler was relating the issue to doing deployment via Github Actions, and so was I. Then the app “fixed itself” either because I’ve done a local fly deploy or for whatever other reason just healing itself.

I’ve done many deploys afterwards via the same workflow in Github Actions and no such issue returned, although in last 2 days that workflow often errored out or hanged for hours even though the deployment was successful. Considering that I faced no such issues on local fly deploy this may be CI-specific but there’s no hard proof for that…

Yes it sounds very similar to this, but have no idea how you guys fixed it?

This is VERY concerning to see, as a deployment with zero issues could take down our entire app?

The only difference was I added —detach to the fly deploy command in CI since it would waste 10 minutes showing propagation status changes which wastes a lot of CI credits.

What’s the actual solution to this other than manually deploying from local machine?

Agreed, I’d also like to see this solved. For now, I’d remove the detach option even if just for debugging purposes - IMO no point saving on CI time if the whole app is shaky.

I agree with the uptime vs saved CI credits - but at the same time - this seems like a major issue that should not be affected by --detach.

This takes the build/deployment from 2 mins to 12 minutes as this app deploys in multiple regions…

Like I said, I don’t care about the credits at the moment, but more so concerned with the issue itself, it almost seems as if these VMs are not being exposed to the outside world for some reason with no issues on the deployment / vm / logs, etc

Can someone at fly take a look at this for us? This is very concerning. (I currently have an app that is bricked that you can take a look at, as we have not started traffic to this app yet, hopefully this can help you track down the issue) - feel free to DM for more details and app-names.

--detach shouldn’t do anything differently (apart from not waiting for the deployment to complete).

This would be a different issue, like if there was no healthy previous version to come back to or the previous version was identical.

To be honest, Nomad is quite opaque in these cases. It’s hard to know what happened exactly. That’s one of the reasons we’re moving away.

I’m going to look into your app. DMing you.

So my theory was correct in the sense that there was no [[services]] attached to the vm making the app not accessible to the outside world.

This happen because the CircleCI fly deploy was not including the --config parameter to the fly.toml file - this is not normally required but was required for me as the fly.toml file was not located in the same directory when running the deploy command.

By simply adding this config and pointing it to the correct path (1 directory down) it worked like a charm.

I want to confirm this issue 100% was caused by user error - nothing that is related to the fly service.

@ksluszniak - I would bet you are having the same issue, in your CI make sure fly deploy can find the fly.toml file.

Thanks @jerome

2 Likes

We’re actually shipping a change that will use the last good config file if you deploy with no config specified. Sometimes it’s useful to deploy just an image without knowing about the config, it’s rare that people would really want to deploy an empty config when a file is missing.

1 Like

Its funny you mention that, I was going to suggest something along those lines.

no config - no changes

:+1:t2:

Thanks but surely my case was different as CI was working fine both before and after the temporary downtime and fly.toml was always in place.

Looking at yesterday’s OVH case I’m hoping it was also some one off incident that won’t return often or ever again (here’s to wishful thinking ;)).

Bear in mind that the following still holds true in my case, showing that either GH action or Fly infra does suffer some issues:

  • GH action was hanging for hours 2 days ago even though local deploy wasn’t
  • GH action was throwing “no deployment to monitor” multiple times even though local deply wasn’t
  • (maybe it’s related, maybe not) both deploy and postgres create were returning “unknown error” randomly in the last couple of days

This is a complete list of concerns as I was setting up a new clustered cross-region Elixir
app in last couple of days. (Not much IMO considering how incredible Fly is for Elixir apps. :rocket:)

Github Action hangs are something of a known issue. For whatever reason, Github Actions hang trying to show deployment progress. The deploy continues in the background, but the action just spins showing no further output. We’re working on this, though we aren’t completely certain why it happens.

The “no deployment to monitor” and “unknown errors” yesterday were the result of the outage.

1 Like

@kurt Thanks for the info! For the record, I saw the unknown error repeatedly several days ago when creating postgres and doing local non-GH deploy. Both just stopped happening eventually, both (luckily :slight_smile:) just before I got to execute my last resort plan which was to create another organization and start over.

Ah! If that happens again, try running:

LOG_LEVEL=debug fly deploy

And paste the output here. There are some bugs that can cause that, the logs should tell us what’s up.

1 Like

@jerome

To be honest, Nomad is quite opaque in these cases. It’s hard to know what happened exactly. That’s one of the reasons we’re moving away.

Sounds interesting. Can you share more info about that transition?

1 Like

I also saw the hanging deployment in CircleCI - solved by using --detach