Something not right on Fly.io

I’m also getting some issues with my wireguard peer, can’t ssh into my application at all now

I have the same issue, cannot deploy any change for past few hours

Me too (django app).

Yeah, I seem to be stuck too with a rails app.

This is with debug log

Running release task (pending)... 🌍DEBUG --> POST https://api.fly.io/graphql

{
  "query": "query ($id: ID!) { releaseCommandNode: node(id: $id) { id ... on ReleaseCommand { id instanceId command status exitCode inProgress succeeded failed } } }",
  "variables": {
    "id": "rcmd_v0or2w9dg18y9gxk"
  }
}

DEBUG {}
Running release task (pending)... 🌎DEBUG <-- 200 https://api.fly.io/graphql (214.83ms)

{
  "data": {
    "releaseCommandNode": {
      "id": "rcmd_v0or2w9dg18y9gxk",
      "instanceId": null,
      "command": "bin/rails fly:release",
      "status": "pending",
      "exitCode": null,
      "inProgress": true,
      "succeeded": false,
      "failed": false
    }
  }
}

and just keeps going on and on like this.

Everything is green on https://status.flyio.net/ so I’m not sure if anyone is looking into this cc @michael

I am also not able to deploy at the moment. Stuck on: Running release task (pending)

Also unable to deploy. Started last night for me.

App is still up, so that’s good :slight_smile:

yeah, same for me. Started around 3 Feb 19:00:00 GMT+1

Yep same thing here. Pretty frustrating but thanks for keeping an eye out

Update on my situation: turns out my app was actually fine but had a different underlying problem.

My app relies on Litestream (author also works for Fly i think) and it seems that the backup process was somehow corrupted by an earlier Fly outage (proxy issue or otherwise). The backup was consequently poisoned and unrecoverable. I had to dig into my other backups in order to recover it fully, losing about 5 days of database activity in the process.

I’m gonna look deeper into the original cause of the issue, but it was at the same outage window as the earlier reports of this issue. If you’re experiencing this issue still, make sure that your data source isn’t corrupted and causing your application to be unrecoverable.

1 Like

Hello,

I’m also suffering intermittent errors (but long in downtime) in several apps:

3rd of February

  • Rails and Elixir apps down at 20:10h UTC. Downtime of ~15 minutes. Came back by themselves.

4th of February

  • Rails app went down at 06:05h UTC. Downtime of 38 minutes.

5th of February

  • Elixir app went down at 22:28 UTC. Downtime of 13 minutes.
  • Elixir app went down again at 22:49 UTC. Downtime of 3 minutes.

This weekend we accumulated a total of 1 hour and 24 minutes of downtime across two apps. And of course we cannot really get a response (we couldn’t get one the last time we wrote about problems here).

The errors vary: sometimes the PostgreSQL instance is down, sometimes I cannot even connect to the HTTP service.

What can we do to improve this? My apps are in the paid tier but not generating expenses (yet). Is this the same for paying apps?

The elixir site is quite popular and receives a 1k+ visits per day, so it would be nice to find a solution.

Update

The error I’m currently seeing as I wrote this is:

(DBConnection.ConnectionError) tcp connect (top2.nearest.of.MY-DB-NAME.internal:5432): non-existing domain - :nxdomain

so it looks like a DNS resolving issue on the instance :sweat_smile:

Also, it is an intermittent one: some requests go through, and other don’t - probably the ones going through are persistent connections in the pool.

1 Like

Same issue here :innocent:

This topic was primarily about the outage on Friday. If you’re having issues since then, it’s likely unrelated.

If your app is having reliability issues, please ensure you’re running 2+ database nodes and 2+ application instances.

Also run fly status --all -a <pg-app> and fly status --all -a <app> and make sure you’re not getting unexpected restarts or vm failures.

If you’re getting delays deploying, this is likely due to intermittent capacity issues in European regions. We’re prioritizing deploys on paid plans. If you are having issues and are on a paid plan, please email the premium support address in your profile and we’ll look into it.

1 Like

Hi Kurt,

I have a paid account (albeit a pay as you go hobby plan pending launch) but I do not see an email support address in my profile. Where should I look to find that?

Thanks

I don’t know if it is related or not, because I’m suffering issues since Friday and my apps have no new releases since weeks ago. The last and more painful (the DB :nxdomain one) is specially hard, as it is not related to the app.

No restarts at all. Also a third app of my own (also Elixir one) is suffering from the DNS issues trying to connect to the database since ~10h ago. Tried to downscale to zero and start the DB again, restart the app… Nothing fixes it.

I’m in the pay as you go plan, so I suppose I have no support available. Is there anything I can do aside form backing it up and start from scratch?

Seeing this again today.

Also seeing this today: Could not proxy HTTP request. Retrying in 1000 ms

sea region, but I’m guessing it’s a broader issue. Status page shows an outage around state propagation An ongoing upgrade is causing delayed app instances state propagation but it’s unclear if that’s the same source of the errors that I’m seeing that seem to have to do with edge routing.