Unreachable Servers

I noticed over the weekend that my servers kept getting scaled down to 0, despite having the following in my configuration under [[services]]

  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  primary_region = "SJC"

I decided to remove the auto scaling from my configuration and redeploy, and now my servers are unreachable despite every build succeeding and the dashboard saying everything is up. I also attempted to manually scale down to 0 and back to 1 to try and clear anything that might have been left behind from an earlier build, and it is still unreachable. I’m at a loss as to what to do since everything looks fine from the outside but it is clearly not. Everything was working fine for month before this weekend.

Hi bpancost,
We’re having issues with routing, as a result of some internal components going down. We’re updating our status page to more accurately reflect the issue.

Any updates on that update? My app hasn’t been accessible since a deploy about 30 minutes ago. Basic health check fails, restarting does nothing, destroying and recreating the machine does nothing.

ffs this is absurd. I did a deploy and now I’m down… IF YOU’RE DOWN DONT LET ME DEPLOY AND SAY WHY.

I’m migrating to ECS tomorrow, this is absurd that it’s taken literally HOURS to get your service back online.

I’m relying to my own comment here because I’m frankly unsure whats going on, the status page says that app logs are delayed, but my entire app is down. Can you please clarify that these events are related?

This is not how you communicate an outage in 2024.

I believe they are related. I’m experiencing the same thing as you: app down ever since a deploy that occurred during the incident. I’m similarly frustrated that 1) I was able to deploy when it’s impossible to bring the new machine back up, and 2) that the status page is so unclear about what’s going on, whether progress is being made, etc.

Yeah this is really unfortunate. I had some people raise an eyebrow at me when I said i was going to use Fly, now I’m getting it.

Well at least my servers are back up at this point. What is left I suppose is that my last machine kept stopping despite the configuration clearly indicating that one machine should always be left online.

I redeployed to get rid of the auto scaling since it kept turning off all of my machines, but now my servers are unreachable. This is truly unbearable to have major outages for hours 2 days in a row.

Hello everyone,

Most of the issues in raised in this thread seem attributable to the DDoS which we were suffering this week; more details here. I don’t however want to overlook the issue raised in the initial post, where the autoscaling feature was not working.

Autoscaling should be reliable and it sounds like you were having issues before the DDoS began. If you would like to talk more about what you’re facing I’d be happy to investigate if you’re still having autoscaling trouble.

