Yeah, same problem, all the apps down and fly website as well.
Same. My app is down, Fly API is also down.
$ fly status Error failed to get app: Post "https://api.fly.io/graphql": dial tcp 220.127.116.11:443: connect: connection refused
Hey, we did a bad deploy and are reverting it now. Sorry about that.
Yea you’re right. It’s all apps.
thought it was just me for a hot second, but yeah - fully down it seems. happy Friday!
small nit/request: it would be cool to see https://status.flyio.net/ get updated asap w/ stuff like this. I personally don’t host anything that crucial on here (yet!) but that’s where I check first
We’re investigating: Fly.io Status - fly-proxy outage
Fly guys this ceases to amaze me as there is not a single week without one (or now all) of my apps being down at least once or even twice.
Your service is measurably getting worse with time not better.
I would think twice before hosting something crucial - I do …
Here is an example of what I am talking about: reoccurring error - could not find an instance to route to - #4 by michael3
I am paying for HA postgres cluster and a couple of days ago my app was crashing due to being unable to connect to database.
I am currently in a process of migrating away from Fly and I would at least until the platform stabilizes and their staff starts looking into problems of paying customers instead of ignoring them after a basic reply.
It is a pity since I really like the Fly hosting model and the simplicity of the platform.
All seems fine on my end again
This is not acceptable if Fly IO wants to be taken seriously. No other provider I have ever dealt with that has all regions go down. Rollout your proxy by region, or have smaller test environments in each region. Have a health check revert if an upgrade goes bad. In 2023, there is no reason for an outage like this. This makes me concerned if there is not checks in place to prevent this. I have switched one of seasoncast.com api services (my app) to Fly from digital ocean… Friday is the biggest day of the week for my service. I am considering moving it back after this.
Not sure. Incidents happen to every cloud platform, I have experienced a few on Heroku and AWS.
Some months ago, Heroku had an issue with DNS resolution, which brought down dozens of our apps for more than an hour.
I would argue Digital Ocean and Fly have different purposes
Why not both? I found Fly Machines exciting and flexible for some many use cases.
@michael3 @themaxsmith I totally understand your frustration here, outages suck. We’re a pretty new company, and it’s not fun to have to deal with someone else’s growing pains. It makes total sense if you decide to migrate off our platform, please let us know if there’s any way we can help if that’s what you end up doing.
@michael3 if you’re interested in us helping configure your app for HA postgres resiliency, we do have paid email support available now. Our infrastructure is somewhat unique, and it’s not always intuitive how to configure apps/dbs for HA.
There’s no reason an app should go down when a single Postgres instance fails. This is a somewhat common issue we help people with every day. But I get if you’re too frustrated to work with us, too.
All valid points. Things shouldn’t break as often they do. There’s isn’t much clarity at all sometimes (ex). And from observation, Fly seems to have a culture of releasing quickly and not really over-engineer stuff. I’ve called on them for more deliberation and the need to respect the scale of their operations once before, but it isn’t all that bad either.
…when the book hits the real world sometimes you find new failure modes, software has bugs, or humans find creative mistakes. It’s also very hard to build global scale systems with zero possibly of global failure. But every time a crack is found, you learn something and do what you can to eliminate the whole class of related failure modes.
That’s a comment from a Googler on GCP’s global outage: Obviously not authorized to release more details than have already been made pub... | Hacker News (2019).
Speaking from personal experience, I was in the team when DynamoDB (2015), Elasticsearch Service (2018) went down nearly globally (it was just IAD for both, which is also the “primary” region which meant a lot of other unexpected things also happened)… CloudFront also faced its own share of terrible outages over the years and the learnings from it were distilled into an internal-only talk at the time, which was so popular within AWS that it was eventually presented at re:Invent 2016: AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazon CloudFront (CTD303) - YouTube
This stuff isn’t simple to accomplish for a team as small as Fly. I mean, the CEO is still replying to customer support emails and forum posts.
I’m confident once they staff up, things will considerably improve.
Still problems with FRA region. Our apps has been unreachable for 2 days now.
@tobias which apps are you having issues with?
Emailed support email for org: vinden
I am having trouble understanding how am I supposed to set up MANAGED postgres cluster for HA - should not that be handled by your side? Furthermore I am running phoenix application that has been setup to use Dockerfile and config generated by
flyctl along with the configuration options. I have also gone through your docs and I do not see any steps that relate to database that I have ignored.
Exactly my thinking. Where we differ is that I do not believe I should need to buy paid support in order to get someone to have a look at something that is crashing multiple times a week.
My latest database issue was that all the database queries were running into 2sec timeout on every query for two days straight and it was magically fixed by manually restarting database cluster - please don’t tell me that this is an error in my application…
Lastly if there indeed is something that needs an extra configuration eg. some
Ecto options that mitigate these issues then it should be in the docs and not provided by paid email support. I have followed the deployment guide for Elixir apps that is in your docs to the letter and yet this keeps happening.
I am frustrated that is true and I am moving away from Fly yes. But that does not mean I am not willing to work with you guys - I would be glad to get in touch with one of your staff, provide logs and help to identify the problem. I am just not willing to pay for the ‘privilege’ of helping you. I pay more than enough right now for a very sub-par service.
You can’t because Fly doesn’t offer a managed Postgres solution.
Fair point, thank you for pointing that out. I have not seen that article before and just went under the assumption since it is provisioned by fly then it is managed.
Although my issues with the service are somehow not mentioned on that page.
- My cluster has not run out of memory my - the application services very low amount of users atm.
- My cluster has not run out of space - the databse is currently under 500MB in size so that is out of the question.
The cluster itself was created less than two months ago and I have not installed any plugins nor have I messed with updating an image or similar tasks - I am using it less than two months in its vanilla configuration done by fly in HA mode and I have still experienced outages (network issues IMO) multiple times.
Although it is not a managed service it is still something provided by fly on their own infrastructure and proclaimed to be high availability - I am probably naive but I have expected it to work (at least until I have messed with it by eg. updating the image)