Fly down?

Yaeger · January 20, 2023, 11:52pm

I can’t access Sign In · Fly but also our app is offline. Am I the only one experiencing this?

Yaeger · January 20, 2023, 11:52pm

crossworth · January 20, 2023, 11:52pm

Yeah, same problem, all the apps down and fly website as well.

ian1 · January 20, 2023, 11:53pm

Same. My app is down, Fly API is also down.

$ fly status
Error failed to get app: Post "https://api.fly.io/graphql": dial tcp 77.83.143.220:443: connect: connection refused

lillian · January 20, 2023, 11:53pm

Hey, we did a bad deploy and are reverting it now. Sorry about that.

Yaeger · January 20, 2023, 11:53pm

Yea you’re right. It’s all apps.

markthethomas · January 20, 2023, 11:54pm

thought it was just me for a hot second, but yeah - fully down it seems. happy Friday!

markthethomas · January 20, 2023, 11:55pm

small nit/request: it would be cool to see https://status.flyio.net/ get updated asap w/ stuff like this. I personally don’t host anything that crucial on here (yet!) but that’s where I check first

tvdfly · January 20, 2023, 11:58pm

We’re investigating: Fly.io Status - fly-proxy outage

michael3 · January 21, 2023, 12:00am

Fly guys this ceases to amaze me as there is not a single week without one (or now all) of my apps being down at least once or even twice.

Your service is measurably getting worse with time not better.

I would think twice before hosting something crucial - I do …

Here is an example of what I am talking about: reoccurring error - could not find an instance to route to - #4 by michael3

I am paying for HA postgres cluster and a couple of days ago my app was crashing due to being unable to connect to database.

I am currently in a process of migrating away from Fly and I would at least until the platform stabilizes and their staff starts looking into problems of paying customers instead of ignoring them after a basic reply.

It is a pity since I really like the Fly hosting model and the simplicity of the platform.

Yaeger · January 21, 2023, 12:01am

All seems fine on my end again

themaxsmith · January 21, 2023, 12:06am

This is not acceptable if Fly IO wants to be taken seriously. No other provider I have ever dealt with that has all regions go down. Rollout your proxy by region, or have smaller test environments in each region. Have a health check revert if an upgrade goes bad. In 2023, there is no reason for an outage like this. This makes me concerned if there is not checks in place to prevent this. I have switched one of seasoncast.com api services (my app) to Fly from digital ocean… Friday is the biggest day of the week for my service. I am considering moving it back after this.

containerops · January 21, 2023, 1:30am

Not sure. Incidents happen to every cloud platform, I have experienced a few on Heroku and AWS.

Some months ago, Heroku had an issue with DNS resolution, which brought down dozens of our apps for more than an hour.

I would argue Digital Ocean and Fly have different purposes

Why not both? I found Fly Machines exciting and flexible for some many use cases.

kurt · January 21, 2023, 2:02am

@michael3 @themaxsmith I totally understand your frustration here, outages suck. We’re a pretty new company, and it’s not fun to have to deal with someone else’s growing pains. It makes total sense if you decide to migrate off our platform, please let us know if there’s any way we can help if that’s what you end up doing.

@michael3 if you’re interested in us helping configure your app for HA postgres resiliency, we do have paid email support available now. Our infrastructure is somewhat unique, and it’s not always intuitive how to configure apps/dbs for HA.

There’s no reason an app should go down when a single Postgres instance fails. This is a somewhat common issue we help people with every day. But I get if you’re too frustrated to work with us, too.

ignoramous · January 21, 2023, 10:32am

All valid points. Things shouldn’t break as often they do. There’s isn’t much clarity at all sometimes (ex). And from observation, Fly seems to have a culture of releasing quickly and not really over-engineer stuff. I’ve called on them for more deliberation and the need to respect the scale of their operations once before, but it isn’t all that bad either.

…when the book hits the real world sometimes you find new failure modes, software has bugs, or humans find creative mistakes. It’s also very hard to build global scale systems with zero possibly of global failure. But every time a crack is found, you learn something and do what you can to eliminate the whole class of related failure modes.

That’s a comment from a Googler on GCP’s global outage: Obviously not authorized to release more details than have already been made pub... | Hacker News (2019).

Speaking from personal experience, I was in the team when DynamoDB (2015), Elasticsearch Service (2018) went down nearly globally (it was just IAD for both, which is also the “primary” region which meant a lot of other unexpected things also happened)… CloudFront also faced its own share of terrible outages over the years and the learnings from it were distilled into an internal-only talk at the time, which was so popular within AWS that it was eventually presented at re:Invent 2016: https://youtube.com/watch?v=n8qQGLJeUYA

This stuff isn’t simple to accomplish for a team as small as Fly. I mean, the CEO is still replying to customer support emails and forum posts.

I’m confident once they staff up, things will considerably improve.

tobias · January 21, 2023, 6:49pm

Still problems with FRA region. Our apps has been unreachable for 2 days now.

kurt · January 21, 2023, 7:05pm

@tobias which apps are you having issues with?

tobias · January 21, 2023, 7:19pm

Emailed support email for org: vinden

michael3 · January 22, 2023, 8:13pm

Hello Kurt,

I am having trouble understanding how am I supposed to set up MANAGED postgres cluster for HA - should not that be handled by your side? Furthermore I am running phoenix application that has been setup to use Dockerfile and config generated by flyctl along with the configuration options. I have also gone through your docs and I do not see any steps that relate to database that I have ignored.

Exactly my thinking. Where we differ is that I do not believe I should need to buy paid support in order to get someone to have a look at something that is crashing multiple times a week.

My latest database issue was that all the database queries were running into 2sec timeout on every query for two days straight and it was magically fixed by manually restarting database cluster - please don’t tell me that this is an error in my application…

Lastly if there indeed is something that needs an extra configuration eg. some Ecto options that mitigate these issues then it should be in the docs and not provided by paid email support. I have followed the deployment guide for Elixir apps that is in your docs to the letter and yet this keeps happening.

I am frustrated that is true and I am moving away from Fly yes. But that does not mean I am not willing to work with you guys - I would be glad to get in touch with one of your staff, provide logs and help to identify the problem. I am just not willing to pay for the ‘privilege’ of helping you. I pay more than enough right now for a very sub-par service.

Best regards
Michael

sanswork · January 22, 2023, 10:31pm

You can’t because Fly doesn’t offer a managed Postgres solution.