Outage caused by Fly infrastructure?

Hi, we’ve been running an app for ~80h, 4 instances in 2 regions (iad, fra).

We had an incident where the app started responding with 401 to customers.

The period of the outage appears to have been Dec 19 8:41pm - Dec 20 8:19am UTC

There was no new deployment from our side.

The solution was to redeploy the app (exact same code).

This happened to two of our apps on Fly at the same time. The other app is just for receiving webhooks and doesn’t even have the same auth.

We’re quite confused and unable to point to a mistake on our end so far.

Can you please advise us about anything on your end and how we can further debug this issue (and tips at least appreciated).

Thanks!

1 Like

Here’s something really confusing. Why does concurrency during the outage go to zero? We didn’t touch that…

Nothing in our infrastructure serves 401s, so if people were getting 401s it was almost definitely from the app process.

Concurrency can go to zero if requests get really fast. 401s seem like they might happen very fast? If that’s true, you might’ve seen a change in response times during that interval as well.

Deploying will often put VMs on new hosts with new IPs. If you’re relying on an upstream API or service, it’s possible it rate limited your existing VMs and the deploy just worked around the rate limit.

Not sure its related but a deployment failed a few hours earlier:

Maybe related to this on your status page:

That sounds reasonable and what we thought too. Our tracing in Honeycomb has an outage that perfectly coincides with this though, hence the confusion.

Yes, the response times went down:

Are you saying that concurrency 0 is a misleading metric stat and that there still were 4 app instances running?

Concurrency is exactly the number of requests that are in progress when we scrape the metric (every 15s). When requests are coming back very fast, that gauge usually hangs out at 0. It’s a little counter intuitive.

Thanks @kurt makes sense!

@kurt We found an issue that suggests a bad deployment on our side actually.

But to confirm that we need to access the config that was used for that release.

It seems that flyctl config only allows access to the config used for the latest release?

Is there another way?

@kurt We deug deeper and we still think there is an issue with Fly. But it happened earlier in our pipeline. We think that the app name we sent via the CLI to deploy was not honoured somehow.

Take a look at two job outputs for the same workflow in our GH CI. The app names in the env are preview ones, and we’ve been using this bash to script flyctl for months without issue until now. We don’t think there is a bug in our ci code. We see in the flyctl output that it pushes to our production app image registry, not the preview one.

Given that this stuff coincided around the same time as Fly was having deployment issues on their platform, we are inclined to believe the issue affected our deployment pipelines. Could you confirm this is possible?

Here is the bash we use. Again we’re confident about it, use it all the time, etc.:

- name: Deploy
      id: deploy
      env:
        FLY_API_TOKEN: ${{inputs.flyApiToken}}
        FLY_APP_NAME: ${{ (github.event_name == 'pull_request' || inputs.stage == 'preview') && format('pdp-{0}-{1}', github.event.number, inputs.app) || format('pdp-{0}', inputs.app) }}
        FLY_ORG: prisma${{ (github.event_name == 'pull_request' || inputs.stage == 'preview') && '-preview' || '' }}
        FLY_REGION: iad
        DEPLOYMENT_STAGE: ${{(github.event_name == 'pull_request' || inputs.stage == 'preview') && 'preview' || 'production'}}
      working-directory: apps/${{inputs.app}}/build-deployment
      shell: bash
      run: |
        # Create app first if needed, otherwise regular deploy.
        if ! flyctl status --app "$FLY_APP_NAME"; then
          flyctl launch \
            --no-deploy \
            --copy-config \
            --name "$FLY_APP_NAME" \
            --region "$FLY_REGION" \
            --org "$FLY_ORG"
          if [ '${{inputs.secretsPassword}}' != '' ]; then
            flyctl secrets set --app "$FLY_APP_NAME" 'SECRETS_PASSWORD=${{inputs.secretsPassword}}'
          fi
          if [ '${{inputs.database-connection-string}}' != '' ]; then
            flyctl secrets set --app "$FLY_APP_NAME" 'SERVICES_DB_URL=${{inputs.database-connection-string}}'
          fi
          # To see sizes: flyctl platform vm-sizes
          flyctl scale vm shared-cpu-1x --app "$FLY_APP_NAME" --memory 1024
        fi

        # Below we expose git environment variables to Fly apps.
        # See: https://docs.github.com/en/actions/learn-github-actions/contexts#github-context

        BRANCH=${{ github.event_name == 'pull_request' && format('{0}', github.head_ref) || format('{0}', github.ref_name) }}
        # See call outs about SHA here https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request
        SHA=${{ github.event_name == 'pull_request' && format('{0}', github.event.pull_request.head.sha) || format('{0}', github.sha) }}

        if [ "$DEPLOYMENT_STAGE" == 'preview' ]; then
          # 
          flyctl deploy \
            --app "$FLY_APP_NAME" \
            --region "$FLY_REGION" \
            --strategy immediate \
            --env DEPLOYMENT_STAGE="$DEPLOYMENT_STAGE" \
            ${{ github.event_name == 'pull_request' && format('--env GIT_COMMIT_PR={0}', github.event.number) || format('') }} \
            --env GIT_COMMIT_BRANCH=$BRANCH \
            --env GIT_COMMIT_SHA=$SHA \
            --env GIT_COMMIT_AUTHOR=${{github.actor}}
        else
          flyctl deploy \
            --remote-only \
            --env DEPLOYMENT_STAGE="$DEPLOYMENT_STAGE" \
            ${{ github.event_name == 'pull_request' && format('--env GIT_COMMIT_PR={0}', github.event.number) || format('') }} \
            --env GIT_COMMIT_BRANCH=$BRANCH \
            --env GIT_COMMIT_SHA=$SHA \
            --env GIT_COMMIT_AUTHOR=${{github.actor}}
        fi

        flyctl status --app "$FLY_APP_NAME" --json >status.json
        cat status.json

        # Make some info available to the GitHub workflow.
        hostName=$(jq -r .Hostname status.json)
        appId=$(jq -r .ID status.json)
        echo "::set-output name=host-name::$hostName"
        echo "::set-output name=url::https://$hostName"
        echo "::set-output name=id::$appId"
1 Like

Also I am confused by this output from flyctl. It also looks like some kind of bug or platform issue:

The “No deployment available to monitor” error is the issue we were having. Deploys are working fine in the background, but the data we sync to update flyctl is lagging.

Those mismatched app names make me think there’s an app = "pdp-zebra" in the fly.toml. The env var is pdp-3118-zebra, but the deploy is definitely happening against pdp-zebra, note the line about the config with fly.toml.

We can look up previous release changes for you. You all are paying, I think, so if you just choose the appropriate plan here, you’ll get a paid support email. Those plans are basically a minimum commitment, they won’t cost you anything else: Plan Pricing · Fly

Those mismatched app names make me think there’s an app = "pdp-zebra" in the fly.toml. The env var is pdp-3118-zebra, but the deploy is definitely happening against pdp-zebra, note the line about the config with fly.toml.

Yep that’s true, that’s what you see on line 238 for example where it says:

“An existing fly.toml file was found for app pdp-mammoth”.

But once we run flyctl we are always passing the --app flag, or, in the create case, using flyctl launch, the --name flag.

We are wondering if its possible that the flags passed failed to be respected, thus not overriding the fly.toml file.

We only had problems with flyctl during the noted Fly platform issues, its bash/CI code we’ve not touched, nor ever had issues with like this, for around 90+ days now.

We can re-run these jobs and see different results now in CI than they emitted during said problem period on Dec 19. :thinking:

We can look up previous release changes for you. You all are paying, I think, so if you just choose the appropriate plan here, you’ll get a paid support email. Those plans are basically a minimum commitment, they won’t cost you anything else: Plan Pricing · Fly

Yep we’re paying so we can go through that channel.

Here is an example of re-running the CI job on GH (left is old, bad; right is new re-run working), it shows the output diff we appear to have gotten from Fly:

This is a CI job re-run, so its the same code, envars, etc.

1 Like

Debug in public is more fun! (:

2 Likes

From our point of view we’re still pretty sure the fault wasn’t on our side. At the same time we’ve moved on since this and no longer use the same token/account between production/preview making this kind of cross stage deployment impossible for us going forward.

1 Like

I vaguely remember a regression in flyctl that changed how -a works. Based on what you saw, that seems like a good suspect.

Thanks for the insight Kurt!