App resiliency feature round-up

Over the last few months, we’ve added some features that can help your apps be more resistant to hardware failures and outages.

We know that it can be difficult to get the big picture of what these features are and how they work, so we added a new reference doc:

Let us know if we missed anything or if we can make a topic easier to understand!


PS - Here are the Fresh Produce posts about the features that we cover in the doc:

9 Likes

In our case, a scaled-down machine comes right back up (this has been the case for months now). Do we get a discount till the bug is fixed? :slight_smile: I may be wrong but I think our bills could be 40% or so less than what we have been paying.

On failing readiness health checks, are requests routed to machines in other regions or in the same region? What if there aren’t any machines in the same region but exist in other regions?

Thanks.

They will be routed to any healthy machine. Even in other regions.

1 Like

In our case, a scaled-down machine comes right back up (this has been the case for months now). Do we get a discount till the bug is fixed? :slight_smile: I may be wrong but I think our bills could be 40% or so less than what we have been paying.

I don’t have a timeline or any info about this issue at the moment, but we haven’t forgotten about it.

1 Like

While we’re looking into the reasons why auto start and stop might not be working in some cases, you can set both auto_stop_machines and auto_start_machines to false in the fly.toml file.

If you’re concerned that your usage was affected before turning the feature off, then you can contact billing@fly.io and let them know what happened.

1 Like

It isn’t just auto-start / auto-stop. Machines have been waking up even without traffic or sometimes just to serve a single connection (then get taken down by our code because idle, and immediately spun back up again for just a single connection or no connection even… this keeps repeating) ever since we’ve begun using it (Oct, 2022; ref).

That’s actually auto_start_machines, it just defaults to true. The auto_stop_machines logic now attempts to stop as many as it can, and concentrate requests/connections on as few as possible.