Outage caused by Fly infrastructure?

jasonkuhrt · December 20, 2022, 5:35pm

Hi, we’ve been running an app for ~80h, 4 instances in 2 regions (iad, fra).

We had an incident where the app started responding with 401 to customers.

The period of the outage appears to have been Dec 19 8:41pm - Dec 20 8:19am UTC

There was no new deployment from our side.

The solution was to redeploy the app (exact same code).

This happened to two of our apps on Fly at the same time. The other app is just for receiving webhooks and doesn’t even have the same auth.

We’re quite confused and unable to point to a mistake on our end so far.

Can you please advise us about anything on your end and how we can further debug this issue (and tips at least appreciated).

Thanks!

jasonkuhrt · December 20, 2022, 6:06pm

Here’s something really confusing. Why does concurrency during the outage go to zero? We didn’t touch that…

kurt · December 20, 2022, 6:12pm

Nothing in our infrastructure serves 401s, so if people were getting 401s it was almost definitely from the app process.

Concurrency can go to zero if requests get really fast. 401s seem like they might happen very fast? If that’s true, you might’ve seen a change in response times during that interval as well.

Deploying will often put VMs on new hosts with new IPs. If you’re relying on an upstream API or service, it’s possible it rate limited your existing VMs and the deploy just worked around the rate limit.

jasonkuhrt · December 20, 2022, 6:14pm

Not sure its related but a deployment failed a few hours earlier:

Maybe related to this on your status page:

jasonkuhrt · December 20, 2022, 6:17pm

That sounds reasonable and what we thought too. Our tracing in Honeycomb has an outage that perfectly coincides with this though, hence the confusion.

jasonkuhrt · December 20, 2022, 6:26pm

Yes, the response times went down:

Are you saying that concurrency 0 is a misleading metric stat and that there still were 4 app instances running?

kurt · December 20, 2022, 6:43pm

Concurrency is exactly the number of requests that are in progress when we scrape the metric (every 15s). When requests are coming back very fast, that gauge usually hangs out at 0. It’s a little counter intuitive.

jasonkuhrt · December 20, 2022, 6:45pm

Thanks @kurt makes sense!

jasonkuhrt · December 20, 2022, 7:08pm

@kurt We found an issue that suggests a bad deployment on our side actually.

But to confirm that we need to access the config that was used for that release.

It seems that flyctl config only allows access to the config used for the latest release?

Is there another way?

jasonkuhrt · December 20, 2022, 8:00pm

@kurt We deug deeper and we still think there is an issue with Fly. But it happened earlier in our pipeline. We think that the app name we sent via the CLI to deploy was not honoured somehow.

Take a look at two job outputs for the same workflow in our GH CI. The app names in the env are preview ones, and we’ve been using this bash to script flyctl for months without issue until now. We don’t think there is a bug in our ci code. We see in the flyctl output that it pushes to our production app image registry, not the preview one.

Given that this stuff coincided around the same time as Fly was having deployment issues on their platform, we are inclined to believe the issue affected our deployment pipelines. Could you confirm this is possible?

Here is the bash we use. Again we’re confident about it, use it all the time, etc.:

- name: Deploy
      id: deploy
      env:
        FLY_API_TOKEN: ${{inputs.flyApiToken}}
        FLY_APP_NAME: ${{ (github.event_name == 'pull_request' || inputs.stage == 'preview') && format('pdp-{0}-{1}', github.event.number, inputs.app) || format('pdp-{0}', inputs.app) }}
        FLY_ORG: prisma${{ (github.event_name == 'pull_request' || inputs.stage == 'preview') && '-preview' || '' }}
        FLY_REGION: iad
        DEPLOYMENT_STAGE: ${{(github.event_name == 'pull_request' || inputs.stage == 'preview') && 'preview' || 'production'}}
      working-directory: apps/${{inputs.app}}/build-deployment
      shell: bash
      run: |
        # Create app first if needed, otherwise regular deploy.
        if ! flyctl status --app "$FLY_APP_NAME"; then
          flyctl launch \
            --no-deploy \
            --copy-config \
            --name "$FLY_APP_NAME" \
            --region "$FLY_REGION" \
            --org "$FLY_ORG"
          if [ '${{inputs.secretsPassword}}' != '' ]; then
            flyctl secrets set --app "$FLY_APP_NAME" 'SECRETS_PASSWORD=${{inputs.secretsPassword}}'
          fi
          if [ '${{inputs.database-connection-string}}' != '' ]; then
            flyctl secrets set --app "$FLY_APP_NAME" 'SERVICES_DB_URL=${{inputs.database-connection-string}}'
          fi
          # To see sizes: flyctl platform vm-sizes
          flyctl scale vm shared-cpu-1x --app "$FLY_APP_NAME" --memory 1024
        fi

        # Below we expose git environment variables to Fly apps.
        # See: https://docs.github.com/en/actions/learn-github-actions/contexts#github-context

        BRANCH=${{ github.event_name == 'pull_request' && format('{0}', github.head_ref) || format('{0}', github.ref_name) }}
        # See call outs about SHA here https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request
        SHA=${{ github.event_name == 'pull_request' && format('{0}', github.event.pull_request.head.sha) || format('{0}', github.sha) }}

        if [ "$DEPLOYMENT_STAGE" == 'preview' ]; then
          # 
          flyctl deploy \
            --app "$FLY_APP_NAME" \
            --region "$FLY_REGION" \
            --strategy immediate \
            --env DEPLOYMENT_STAGE="$DEPLOYMENT_STAGE" \
            ${{ github.event_name == 'pull_request' && format('--env GIT_COMMIT_PR={0}', github.event.number) || format('') }} \
            --env GIT_COMMIT_BRANCH=$BRANCH \
            --env GIT_COMMIT_SHA=$SHA \
            --env GIT_COMMIT_AUTHOR=${{github.actor}}
        else
          flyctl deploy \
            --remote-only \
            --env DEPLOYMENT_STAGE="$DEPLOYMENT_STAGE" \
            ${{ github.event_name == 'pull_request' && format('--env GIT_COMMIT_PR={0}', github.event.number) || format('') }} \
            --env GIT_COMMIT_BRANCH=$BRANCH \
            --env GIT_COMMIT_SHA=$SHA \
            --env GIT_COMMIT_AUTHOR=${{github.actor}}
        fi

        flyctl status --app "$FLY_APP_NAME" --json >status.json
        cat status.json

        # Make some info available to the GitHub workflow.
        hostName=$(jq -r .Hostname status.json)
        appId=$(jq -r .ID status.json)
        echo "::set-output name=host-name::$hostName"
        echo "::set-output name=url::https://$hostName"
        echo "::set-output name=id::$appId"

jasonkuhrt · December 20, 2022, 8:04pm

Also I am confused by this output from flyctl. It also looks like some kind of bug or platform issue:

kurt · December 20, 2022, 11:48pm

The “No deployment available to monitor” error is the issue we were having. Deploys are working fine in the background, but the data we sync to update flyctl is lagging.

Those mismatched app names make me think there’s an app = "pdp-zebra" in the fly.toml. The env var is pdp-3118-zebra, but the deploy is definitely happening against pdp-zebra, note the line about the config with fly.toml.

We can look up previous release changes for you. You all are paying, I think, so if you just choose the appropriate plan here, you’ll get a paid support email. Those plans are basically a minimum commitment, they won’t cost you anything else: Plan Pricing · Fly

jasonkuhrt · December 21, 2022, 1:03am

Those mismatched app names make me think there’s an app = "pdp-zebra" in the fly.toml. The env var is pdp-3118-zebra, but the deploy is definitely happening against pdp-zebra, note the line about the config with fly.toml.

Yep that’s true, that’s what you see on line 238 for example where it says:

“An existing fly.toml file was found for app pdp-mammoth”.

But once we run flyctl we are always passing the --app flag, or, in the create case, using flyctl launch, the --name flag.

We are wondering if its possible that the flags passed failed to be respected, thus not overriding the fly.toml file.

We only had problems with flyctl during the noted Fly platform issues, its bash/CI code we’ve not touched, nor ever had issues with like this, for around 90+ days now.

We can re-run these jobs and see different results now in CI than they emitted during said problem period on Dec 19.

jasonkuhrt · December 21, 2022, 1:04am

We can look up previous release changes for you. You all are paying, I think, so if you just choose the appropriate plan here, you’ll get a paid support email. Those plans are basically a minimum commitment, they won’t cost you anything else: Plan Pricing · Fly

Yep we’re paying so we can go through that channel.

jasonkuhrt · December 21, 2022, 1:36am

Here is an example of re-running the CI job on GH (left is old, bad; right is new re-run working), it shows the output diff we appear to have gotten from Fly:

This is a CI job re-run, so its the same code, envars, etc.

ignoramous · December 21, 2022, 6:05am

Debug in public is more fun! (:

jasonkuhrt · January 13, 2023, 3:00am

From our point of view we’re still pretty sure the fault wasn’t on our side. At the same time we’ve moved on since this and no longer use the same token/account between production/preview making this kind of cross stage deployment impossible for us going forward.

kurt · January 13, 2023, 3:44am

I vaguely remember a regression in flyctl that changed how -a works. Based on what you saw, that seems like a good suspect.

jasonkuhrt · January 13, 2023, 3:16pm

Thanks for the insight Kurt!

Topic		Replies	Views
All the sudden 502 on external access after deploying my app Questions / Help	4	619	November 28, 2022
Debugging 502s, confused, caused by deploy? 🤔	1	268	February 1, 2023
Something went wrong? Questions / Help	42	1432	September 22, 2022
Dowtime for more that 15 minutes already	7	346	November 3, 2022
Could not proxy HTTP request. Retrying in 1000 ms	16	979	March 7, 2023

Outage caused by Fly infrastructure?

Related topics