What’s weird is that fly status command reported everything as healthy, but the app was not responding to requests. So if the fly dashboard goes down, will all the apps go down?
Everything seems to be back up and running.
What’s weird is that fly status command reported everything as healthy, but the app was not responding to requests. So if the fly dashboard goes down, will all the apps go down?
Everything seems to be back up and running.
API outages don’t normally affect all apps. In this instance, there’s something else that broke apps and also took our API down. We’re looking into it, but this was not the same scale issues we’ve been fighting with the API recently.
It almost seems like a nomad issue, like scheduling. Thanks for the updates.
I like blaming Nomad for things, but this was something network related. Entirely our problem, in other words!
Was this issue across all networks or would our apps not go down if I was in multiple data centers? I just noticed we were in only one.
This was specifically in IAD and LHR. We’re trying to figure out what the full impact of the network issue was, too, it seems like VMs could connect to some places and not others, and our proxy could connect to some apps and not others.
I don’t think multiregion would have completely protected you from this. It would have helped, but requests to IAD/LHR still would have had issues.
OK, thanks, good to know. I was just wondering since that seems like the beauty of the platform, being able to deploy to multiple datacenters instantly. Props to your team for responding so quickly to the incident. I’m trying to see if the platform would be good for production for my small startup.
Offtopic, but I have a question:
I really like how easy things are for deployment. I’m not sure how, but our Kubernetes cluster is running the Docker image at 1 GB RAM and Fly is running it at about 192 MB RAM. Is flying just very efficient with ram?
Having been impacted both by last month’s UDP-kiss-of-death (UDP, outage & DNS) and this problem (LHR ), both of which I believe had been due to failed network changes/implementations, I’m not entirely convinced either root cause can be traced to the recent influx of Herouku users and their traffic.
The experience is a little frustrating, to say the least .
This was not the result of a Heroku influx. The ongoing API stability issues are.
This is a definite growing pain problem, however. We are a small company and we’re working out processes (and finding new ways things break) every day. We beat ourselves up over these, for what it’s worth, and it’s important we get better.
Without excusing them: I think it’s worth noting that we are actually small and don’t have 10 years of Fly.io infrastructure maturity under our belt. It’s understandable that this is frustrating. I would love to say we won’t frustrate you again, but I don’t think this is the last time we’re going to have a problem like this.
I accepted long ago that I can’t even use company size or revenue as a proxy for reliability, only this morning I’ve been faced with Cloudflare Worker issues due to “Error: Network connection lost” - Cloudflare Status - Increased HTTP 500 Errors in several locations ; It happens to the best of 'em, in this Fly is not alone.
If AWS’ eu-west-2 falls over in the next ~20 minutes I’ll have a (cloud poker) Full House for the day (BST)…
From my past experiences spending a substantial amount of time setting up and deploying to various platforms, I have found that the fundamentals of the technology coupled with technical capabilities of the development team supporting it matter more than the capacity for failure, which is obviously not unique to Fly.
I haven’t been a Fly customer very long (since the end of August), and yet in that very brief amount of time I’ve been blown away by extremely impressive tech and somehow an even more impressive team behind it—enough so to take the risk to transition all of our production apps over from their most recent home on AWS ECS. (Shout out to Amos who I’ve known from early ooc days on irc and Ben Johnson who I only started following recently after discovering Litestream—which we’re using btw to build the v2 of our platform in Blazor Server + EF Core on Sqlite).
Long story short, as difficult and frustrating as it was for me today to go to my CEO and explain why we went down and how it was out of our control, Fly is promising enough for us to still be worth it. YMMV.
Disclaimer: Our main production app (that also provides apis for our native apps) went down for 25 minutes in the middle of the day today as a result of the networking outage affecting iad, so I have some skin in the game here. As an early-stage startup founder myself, the growing pains are palpable.
A couple of years ago we had 13 hours of downtime on our main production database with Google. The call center turned to flames and customers got real angry.
All hosting and cloud providers have downtime but personally Fly has been one of the most reliable providers I’ve used in 20 years.
It’s quite amusing because all the incident responses have actually improved my view of fly as a provider.
Doing network changes is notoriously difficult even with good processes, but in good news, it was a partial outage. Having no view of the architecture, it seems that their networks are not tightly coupled which is great. I am kind of scared that someone is going to flub a router or bgp update at some point though, which would be really bad. From my experience, most outages are either human error or unknown unknowns which you wouldn’t be able to imagine anyway (like the udp one).
And other providers, well, suck. Google takes down their global load balancer once every two years I think. Scary as bleep and no one to call. AWS is very isolated, but everything seems to have a dependency on US-east-1 and I knew the guy who was responsible for running it, and it’s not fun. And totally forgot, Canada just had a nationwide outage with Rogers…
So having partial degradation and someone actually responding on a forum, well, it’s a breath of fresh air as long as it doesn’t happen too frequently.
Holy shit! Fly didn’t announce it on their blog, did they?
a16z on board… I guess we can expect more crypto workloads on Fly
GPUs must surely be up next!
I moved my app from iad to dfw to sjc. I get 503 in all three.
I don’t think so. I read it when doing my “due diligence” let’s say on the legitimacy of the product. Would love a blogpost explaining where the money is going, out of curiosity
I don’t think there’s a whole blog post here, but it’s going to:
It’s also going into our bank account to sit and buy us time while we build the right business model. Pricing is hard and we need time to do it right. VC money lets us be more deliberate about what we’re selling, instead of spamming you all about $5k/mo enterprise plans you don’t actually need.
The only thing I see that could get out of control with fly pricing is bandwidth usage. It seems like it goes out quickly. I’ve also never watched bandwidth usage this closely.
Interesting! Thank you. The ~$32 price for a dedicated CPU instance is very inexpensive for a PaaS, so I’ve been wondering (hoping) that it’s sustainable.
For comparison:
Heroku: $250
Render: $100 or so, but it’s not guaranteed.
Linode: $30 (but not PaaS, IaaS — I’m currently using this with Dokku.)