Deploys are failing

What’s up :grimacing: Is it a problem again with fra region? Seems like this region is problematic :cry: I would move away to a different region, but our DB runs on AWS in fra. Our app will become much slower.

It seems like the deploy went through, but the fly deploy still failed. Confusing.

The deploy after that also failed, but never got through.

Just tried another deploy. 2 VMs get created. They go to state running, but no other VMs are being created. We have 3 process groups scheduler, worker and web with respectively 1, 3, and 2 VMs allocated to it.

Maybe still an issue realted to:

Yea possibly, but this has been a problem for a while now and is marked as “resolved” since yesterday…

Now deploys are finally done:

image

This message is shown under activity though:

Odd

Our deployments are still failing regularly. In one instance, there was even an automatic revert after an initially seemingly successful deployment. Is there any update as to what might be going on here? Running deployment now with debug logging enabled to hopefully get some more insights. Will post results ASAP.

(I’m a colleage of @Yaeger 's)

This is the last bit of output of a deployment with debug logs enabled. It looks like everything is fine, yet somewhere it is decided that there are unhealthy alllocations, and the deployment is aborted.

🌏 DEBUG --> POST https://api.fly.io/graphql

{
  "query": "query ($appName: String!, $deploymentId: ID!, $evaluationId: String!) { app(name: $appName) { deploymentStatus(id: $deploymentId, evaluationId: $evaluationId) { id inProgress status successful description version desiredCount placedCount healthyCount unhealthyCount allocations { id idShort status region desiredStatus version healthy failed canary restarts checks { status serviceName } } } } }",
  "variables": {
    "appName": "staxcloud-prod",
    "deploymentId": "edf2935e-79e4-8595-8104-c192ccd6e8b2",
    "evaluationId": "dbfc18e3-6dcf-05d5-99b1-962b6761b4bd"
  }
}

DEBUG {}
🌍 DEBUG <-- 200 https://api.fly.io/graphql (438.7ms)

{
  "data": {
    "app": {
      "deploymentStatus": {
        "id": "edf2935e-79e4-8595-8104-c192ccd6e8b2",
        "inProgress": true,
        "status": "running",
        "successful": false,
        "description": "Deployment is running pending automatic promotion",
        "version": 205,
        "desiredCount": 6,
        "placedCount": 3,
        "healthyCount": 2,
        "unhealthyCount": 1,
        "allocations": [
          {
            "id": "64b1fcba-a4a5-6440-f73b-db21cb331f7f",
            "idShort": "64b1fcba",
            "status": "running",
            "region": "fra",
            "desiredStatus": "run",
            "version": 205,
            "healthy": true,
            "failed": false,
            "canary": false,
            "restarts": 0,
            "checks": [
              {
                "status": "passing",
                "serviceName": "tcp-8080"
              }
            ]
          },
          {
            "id": "634af7b6-8f5c-a63a-6bad-c1e06fc60c75",
            "idShort": "634af7b6",
            "status": "running",
            "region": "fra",
            "desiredStatus": "run",
            "version": 205,
            "healthy": true,
            "failed": false,
            "canary": false,
            "restarts": 0,
            "checks": []
          },
          {
            "id": "be01895c-2a11-0d3d-87ae-edb38a4a5baf",
            "idShort": "be01895c",
            "status": "pending",
            "region": "fra",
            "desiredStatus": "run",
            "version": 205,
            "healthy": true,
            "failed": false,
            "canary": false,
            "restarts": 0,
            "checks": []
          }
        ]
      }
    }
  }
 6 desired, 3 placed, 2 healthy, 1 unhealthy [health checks: 1 total, 1 passing]
🌍 DEBUG --> POST https://api.fly.io/graphql

{
  "query": "query ($appName: String!, $deploymentId: ID!, $evaluationId: String!) { app(name: $appName) { deploymentStatus(id: $deploymentId, evaluationId: $evaluationId) { id inProgress status successful description version desiredCount placedCount healthyCount unhealthyCount allocations { id idShort status region desiredStatus version healthy failed canary restarts checks { status serviceName } } } } }",
  "variables": {
    "appName": "staxcloud-prod",
    "deploymentId": "edf2935e-79e4-8595-8104-c192ccd6e8b2",
    "evaluationId": "dbfc18e3-6dcf-05d5-99b1-962b6761b4bd"
  }
}

DEBUG {}
🌍 DEBUG <-- 200 https://api.fly.io/graphql (254.64ms)

{
  "data": {
    "app": {
      "deploymentStatus": {
        "id": "edf2935e-79e4-8595-8104-c192ccd6e8b2",
        "inProgress": false,
        "status": "failed",
        "successful": false,
        "description": "Failed due to unhealthy allocations - rolling back to job version 204",
        "version": 205,
        "desiredCount": 6,
        "placedCount": 3,
        "healthyCount": 2,
        "unhealthyCount": 1,
        "allocations": [
          {
            "id": "64b1fcba-a4a5-6440-f73b-db21cb331f7f",
            "idShort": "64b1fcba",
            "status": "running",
            "region": "fra",
            "desiredStatus": "run",
            "version": 205,
            "healthy": true,
            "failed": false,
            "canary": false,
            "restarts": 0,
            "checks": [
              {
                "status": "passing",
                "serviceName": "tcp-8080"
              }
            ]
          },
          {
            "id": "634af7b6-8f5c-a63a-6bad-c1e06fc60c75",
            "idShort": "634af7b6",
            "status": "running",
            "region": "fra",
            "desiredStatus": "run",
            "version": 205,
            "healthy": true,
            "failed": false,
            "canary": false,
            "restarts": 0,
            "checks": []
          },
          {
            "id": "be01895c-2a11-0d3d-87ae-edb38a4a5baf",
            "idShort": "be01895c",
            "status": "pending",
            "region": "fra",
            "desiredStatus": "run",
            "version": 205,
            "healthy": true,
            "failed": false,
            "canary": false,
            "restarts": 0,
            "checks": []
          }
        ]
      }
    }
  }
 6 desired, 3 placed, 2 healthy, 1 unhealthy [health checks: 1 total, 1 passing]
--> v205 failed - Failed due to unhealthy allocations - rolling back to job version 204 and deploying as v206

--> Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort



@fideloper-fly you have been of help in the past, is there any insight you could provide here?

P.S.: The specific app we are talking about is called staxcloud-prod

I believe fra had some capacity issues recently but to my knowledge we increased capacity there - I can double check.

What are your health checks set to? Is it possible some fail randomly if their grace period is too short?

@fideloper-fly we just have the default TCP checks that are created when initialising a fly.toml for Laravel. I doubt that’s the cause. We scaled from 3 to 1 worker and after that the deployments went through. Perhaps a capacity problem for large VMs… They are dedicated-cpu-2x machines.

I have to say we’ve been having lots of difficulties with Fly.io since we migrated and I’m not sure if it’s gotten better. We reduced our hosting costs a lot, but increased the hours spent on DevOps by a lot, mostly due to broken deployments. Often we need to get new code out fast, and when it doesn’t work, we have to resort to desperate measures to make deployments work: disabling migrations to maybe make it easier for Fly, reducing count of VMs, trying different VM sizes, moving regions etc.