I will be moving away from fly soon. Our service was hugely affected. We run a small rails app which was performing well, had barely 200 requests per day, even that has trouble working properly. Honestly disappointed!! Cloud business is becoming a loot business.
+1
Fly was really nice before
But now it’s going really bad
A very simple app crashes after it has to process more than 3 requests
would have been nice an email for such a critical change, instead of posting here on the threads. i had 2 production incidents over the past 2 days and after this i dont know why i am even trying to stay on this platform. it is really not production ready apart from hobby projects…
People are bitter about the change which is understandable but as a counterargument, I want to share the screenshot of Apdex (0.5s) metric from one of our services:
As you can see, apdex metric is now stable at 0.95, but before Fly throttling caps, there were the days when it went down to 0.89.
I think the majority aren’t mad due to the change, it’s more the lack of communication. The OP posted the update 2 days ago and has been radio silent after.
For anyone else with a Django app that is going down on deploy even though things were working fine before this, I was able to resolve this by having two web facing instances and having health checks with our rolling deploy strategy. Basically, the deployment will now try to start-up each machine one at a time and will run your health checks to verify things are good before continuing to the next machine. I believe if your app has really long start up times, the health checks might timeout and fail the deploy, but a failing deploy is better than having an outage. Hopefully this helps someone else.
The shared cpu cores are effectively useless for any app that does work across cores. The performance cores pricing is high enough that we will be moving off Fly to a more traditional setup using VPS.
Death by a thousand cuts at Fly, this was the last one.
Is this 5 seconds at 100% for each CPU? Just wanted to confirm if the initial balances scale based on the size of the machine.
According to the docs quotas are shared:
Quotas are shared between a Machine’s vCPUs. For example, a shared-cpu-2x Machine is allowed to run for 10ms per 80ms period, regardless of which vCPU is using that time.
I originally misunderstood that as well, but recently downgraded some single-threaded Node servers to shared and can confirm that even though one core is utilizing 20% my quotas are ok.
Yes, like the baseline quota and max balance, the initial balance also scales with the number of vCPUs, so you can run a shared
machine full-throttle (all vCPUs at 100%) for X (=5) seconds on start regardless of size. (Actually it can run for X/0.9375 seconds- since the shared
machine continues to accrue quota at the 6.25% / vCPU baseline, it only draws down 93.75%…)
After some internal discussion earlier today and based on feedback in this thread, we shipped a small update to increase this initial balance from 5 to 50 seconds. This quick change should immediately make deploys a bit smoother for apps with a heavier CPU load during boot, while we work on a longer-term solution to preserve balances between machine versions (or some other way to provide and account for burst when needed).
I voiced the same concern at support. For me, it was the balance reset when you trigger a scale memory
that caused an outage. Apparently the balance resets to 1%, which of course at startup blows that out immediately.
This was changed to be 10% as far I know, so that’s much better.
I do suggest figuring out how to delay certain startup tasks, to flatten out the cpu usage. I still have to do some work myself to accomplish this.
This sounds great. I’ll reconfigure my dev app back down to the smaller size and see what happens!
Agreed, I hadn’t been following the forums all that closely and wasn’t aware this change was in the works. Luckily I’m not running in production yet as we’re still finishing product development (which is also why we didn’t notice the change while it was being tested late last year). But had I been in production the first time we deployed after the throttling changes were live would have caused an outage. An email or other form of communication would have given us heads up and enabled us to temporarily scale up in anticipation of any potential problems.