Hey @fly, I’ve noticed a sudden decrease in throughput on my app. Screenshots below:
(Please ignore the tooltip on this one, it isn’t relevant)
I’m concerned because this app has been running for a long time without any changes. I’ve not deployed an update for quite a while, I haven’t changed any of the scaling settings, and I believe that my account is paid in-full.
It would be great if someone at Fly.io could take a look and let me know if anything changed internally that requires some changes from my side to compensate.
This could be a new bug in our proxy. I’m restarting it everywhere. I think this will help if the bug is what I think it is.
Looks like all traffic is going to 2 instances (our of 20).
I think this will help if the bug is what I think it is.
Yep, it looks like throughput is back up to expected levels.
Is this something that I can expect to experience again in the future, or more of a one-time issue related to a deploy of new proxy code, @jerome?
It might happen again before we fix it (once we’ve found the root cause), but of course you shouldn’t expect issues like these to happen. This was not intended.
Cool, I’ll reply on this thread and @ you (and anyone else you think should be alerted) if it happens again.
Thanks for the speedy fix!
Hey @jerome it has happened again. My app is down to around 15% of normal throughput. Speedy assistance would be appreciated!
Looking into a more permanent fix this time. We have to keep the current issue ongoing while we investigate though!
Thanks for the time you’re spending looking into this! Much appreciated.
I’ve looked at this today and I deployed a change that I hope will resolve the issue. Right now things are looking nice and healthy again, but since this bug takes a couple days to show up, don’t hesitate to ping us again if/when it does!
Hey @amos, @jerome, looks like the fix hasn’t worked, throughput has plummeted again!
Looking again, thanks for the ping!
We pushed out another fix earlier and are monitoring the situation. Do ping us if we doze off!
Hey @amos & @Jerome, looks like we’re getting some odd behaviour again.
And last 6hrs:
To clarify, there have not been any new deploys in the last month, at least, and we’re running a good number of instances.
Any insight you can provide would be wonderful!
The gaps in metrics are explained by recent incidents with the metrics cluster: