Increasing Apps V2 availability

New v2 apps created by fly launch (v0.0.528+) will launch enough machines to provide increased availability in case of hardware failure, all of this while keeping costs down (even lower than before).

The key features we are putting together are automatically starting and stopping machines when the load goes up or down (more at this post), and a fresh new thing we are introducing today called standby machines.

What are Standby Machines?

In short, standby machines are stopped machines that will be started only if the machines they are watching for become unavailable due to a hardware failure.

That is, the standby machine will be dormant, not consuming resources, not adding costs, watching and waiting until its primary have a serious host problem like a disk failure or power outage. Only then, it will wake up and take over.

UPDATE (after comments):

  1. Standby machines and autostart/autostop features are not related.
  2. They are used together to improve availability but the later doesn’t imply the former and vice versa.
  3. In the presence of “services”, fly launch won’t create standby machines ever.
  4. Services always use normal machines and Fly Proxy will only control their state (stopped/started) if you enable autostart/autostop flags on the service
  5. All of this is per process group (that section named [processes] in fly.toml). If missing, it is implied that you have only one process group named app

Why now?

This comes in response to recent reliability issues, the goal is to minimize the impact on applications when hardware failures take down nodes across our fleet.

We believe that by combining the new Fly Proxy powers to automatically start and stop machines when there are services involved, with standby machines when machines are out of Fly Proxy reach, the general resilience of apps to hardware failure will improve substantially.

How does it work?

When deploying your application for the first time, we take the following actions:

  1. Start 2 machines for process groups with services, but also enable auto start/stop to scale them automatically and save costs
  2. Start one always-on and one standby machine for process groups without services. Stopped machines don’t add to the bill.
  3. No matter what, start only 1 machine if the process group have mounts

Confused? Let see an example

app = "myapp"

[processes]
  app = ""
  disk = "sleep inf"
  task = "sleep inf"

[[mounts]]
  source = "disk"
  destination = "/data"
  processes = ["disk"]

[http_service]
  internal_port = 80
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  processes = ["app"]

This application has 3 process groups:

  • “app” group serves a http service
  • “task” group have no mounts nor services
  • “disk” group have mounts (volumes attached)

Once deployed it will create 5 machines, 2 for “app”, 2 for “task” and 1 for “disk”.

See the output of fly status a few minutes after launching

$ fly status
App
  Name     = myapp
  Owner    = personal
  Hostname = myapp.fly.dev
  Image    = library/nginx:latest
  Platform = machines

Machines
PROCESS ID              VERSION REGION  STATE   CHECKS  LAST UPDATED
app     1781329f13e289  1       iad     stopped 1 total 2023-04-19T18:11:39Z
app     e784ee77c47e68  1       iad     stopped 1 total 2023-04-19T18:11:05Z
disk    32874572ae2e38  1       iad     started         2023-04-19T18:10:56Z
task    3d8d501f724289  1       iad     started         2023-04-19T18:11:01Z
task†   e784ee79f41378  1       iad     stopped         2023-04-19T18:11:05Z

  † Standby machine (it will take over only in case of host hardware failure)

Note how all machines in the “app” group were stopped by Fly Proxy due to the lack of requests going into the http service.

Similarly, “task” machine with id e784ee79f41378 is in stopped state, and it was never started, because it is a standby machine for 3d8d501f724289 which is started and running healthy. In case the host of the later have a hardware failure, the former will take its place.

Oh, worth pointing out that “disk” machine is on its own. It is not safe to run two machines for a stateful group so we don’t do it. flyctl won’t create more than one machine by default but you can with fly machine clone :wink:.

That’s all folks.

Happy HA setup to everyone!

18 Likes

So, Fly would create these standby machines (for service and non-service tasks) only if both auto_stop_machines and auto_start_machines are set to true, or by default for all new v2 apps (if so, I think, folks might freak out seeing +1 machines they didn’t provision)?

1 Like

Nope. fly launch creates them and only on first deploy.

2 Likes

Would you be able to provide some resources to migrate an existing V2 app to this configuration?

2 Likes

The autostart/stop settings are easy to migrate, first scale up your app with fly scale count 2 (assuming it is only one machine now) and add the following lines to your services defined in fly.toml:

auto_start_machines = true
auto_stop_machines = true

finally run fly deploy to make it real.

I’m afraid adding standby machines with flyctl out of the first deploy is not possible yet, but stay tuned for the coming days.

3 Likes

Hmm I am a bit confused, fly scale count 2 is something that is valid for v2 apps using machines? I thought the only way to add additional machines for HA was to use the fly machines clone command?

As for the flags, what behavior would be gained by adding these two flags? The reason I ask is that it seems these flags could be either adding a “autoscaling” feature or they could be adding the sit around and wait for outages standby machine behavior?

I am trying to gain the standby machine HA that will take over if the primary machines are no longer accessible by the proxy.

Also, can we place standby machines in other regions in the case that an entire region is down? This all might also relate to my latest post: https://community.fly.io/t/better-understanding-best-practices-for-ha-for-both-web-apps-and-pg-apps

Thanks!

1 Like

Oh! sorry for the confusion. That is on me too because I implemented v2 support for fly scale count and didn’t posted here about it. It has one caveat though, it will fail if you try to scale a process group with mounts, it can only scale mount-less groups.

the auto_start|stop_machines are services’ flags and are only for groups with services, they don’t relate to standbys at all.

Standby machines are only meant for groups without services. The point is that machines for groups without services can be watched by FlyProxy, so the alternative to provide increased availability is to create a dormant machine that can take over when the primary is not available due to a catastrophic event :bomb:

If your app has services (and no mounts) use fly scale count --region REGION app=N, if your app has mounts you still have to use fly m clone --region REGION ID. Fly Proxy will take care of routing in case a whole region goes down.

Remember, standbys are only for process groups without services. I will come back later to answer this.

I don’t mind giving you some recommendations if you share your fly.toml with us.

Yes, definitely, it took far too long for me to process this with all the groups, etc. :slight_smile:

If we manually set the auto_start/auto_stop will the machine be marked as a standby? Or is “standby” a special state? Do we even need to know about this concept if we set auto_start/auto_stop appropriately?

1 Like

Sounds cool.

How well does this handle the thundering heard? Is resource reserved?
Same question for the scale to zero mentioned using auto start/stop.

I can imagine a case where a host goes down and then a large number of standby machines are triggered and if resource isn’t reserved then there’s the prospect of your app going down because the standby machine is unable to start.

Standbys and autostart/stop are not related. The former has nothing to do with the later and vice versa.

Standby is not a special state either, it is a machine property that instruct flyd (the backend component that runs on every host and supervises machines), to keep the machine stopped until some other machine in the same process group is unavailable due to a host failure.

Standbys are not used for machines in process groups that serve services, it doesn’t make sense to do it, much better to launch 2+ non-standby machines and let FlyProxy start and stop them based on load.

STANDBYS are not created for groups that serve services, it doesn’t matter if AUTO_START nor AUTO_STOP are enabled or disabled for those services. From the moment a machine has a service, standbys are out of the picture.

1 Like

Ah, ok.

I don’t really understand why this was built though. The standby’s only work on non-stateful and non-services which means that they could be re-created on any machine anywhere with a docker pull. So why not do that instead? That will re-create the behaviour of v1 nomad. I’m sure there are technical reasons, flyd only knows about local machines, etc. and this was easy, but it seems very interim.

The true win, however, is that you’ll never have to explain or write docs about it because people will already know what to expect.

7 Likes

It’s maybe-interim. The advantage of cold standby machines is, they don’t require any kind of shared state orchestration. So far, we’ve not introduced any serverside orchestration for machines, except for some basic proxy logic that can shut them down based on load.

This is a reliability win. Orchestration is complicated, and extra complicated in a global environment. If you setup a standby machine in Sydney that monitors another machine in Sydney, it will do what you expect close to 100% of the time.

The real question is whether anyone else will use this machines feature. It’s pretty old school. People have been doing cold standbys for ~25 years now. I like it because it’s easy to understand, has two moving pieces, and I can “see” what happens with it. But people still might want more magic than me. :slight_smile:

@charsleysa this doesn’t create any extra thundering herd issues. It might actually reduce them. When you create machines over time, we spread them around. If a single host in a region goes down, I would expect standby machines to come up on a bunch of different hosts. And for bonus points, we know standby machines exist, so we can account for them when we manage capacity.

That said, if you create very large standby machines, you risk them not being able to start when they need to. We didn’t really design this for very large machines, though. This feature is intended for smaller, single instance apps people tend to run as hobby projects. They expect their background machines to keep working even when hardware fails.

5 Likes

One thing I am still confused by is what type of apps can utilize this, web servers? These do not have shared state, but have services and it seems these would not be able to utilize this feature?

They can! But I can’t think of why you’d want to? If you have services already, the proxy will happily start machines for you. The proxy already routes traffic to another machine if a host dies. It’ll also start machines to “scale”.

There’s no reason a web facing machine can’t technically use the standby feature, though.

Ah! That makes perfect sense. So if a machine is failing, the proxy will not only route to a healthy machine, but it will spin up a new machine if there is no alternative healthy machine available, therefor there is no need for a standby machine?

Unless, I’ve misunderstood something, I don’t believe in the v2 platform that anything is created at runtime. You have to create it yourself where you want and then the proxy will turn it on/off. For non-web, you can create a second machine which flyd will turn on/off.

Totally get it and definitely more reliable than the alternative. Erring on reliability is always a good thing. And it also avoids any registry issues since the image is already there, etc.

However, as you can see from the questions in this thread, it’s about expectations. On a VPS / dedicated server, people expect local volumes. On a cloud provider, people expect EBS-like behaviour. The hardware paradigm is there, but people have really forgotten about it.

On the one hand, keeping things consistent is super-important. Always create machines wherever you want and fly will just start / shutdown. And since everyone understands what they’re getting into - local disks, etc. there is no magic and the user can make everything super-reliable because they understand what they’re being provided. I mean, I actually helped build a platform in 2012 on ephemeral volumes based on redis because EBS would go down too often. So I’m cool with it.

On the other hand, at what point do you say user expectations have evolved and since ChatGPT/AI will be the end of civilization as we know it, just bite the bullet and do the standby machine in the background in the same region and don’t make the user think about it. Or do a pull.

After writing this all out, I’m actually more on the side of keeping things consistent with less magic and clearer understanding of the paradigm - the fly way of doing things. However, given the state of the world, I’m pretty sure I’m in the minority.

3 Likes

There are a lot of us who are skeptical of the magic! The good news is, I think there are enough for a good company. The better news is, we can do magic on top of it. :wink:

1 Like

A post was split to a new topic: Proxy errors, app unavailable

In v2, after initially launching an app, I would like to change the concurrency settings to “requests”. If I add the following sections to my toml and redeploy, fly config show does not include these settings and they are not effective.

[services.concurrency]
  type = "requests"
  hard_limit = 25
  soft_limit = 20

[http.services.concurrency]
  type = "requests"
  hard_limit = 25
  soft_limit = 20

Hey @jflamy-dev, it looks like you have a typo in your toml additions. The section you need is http.service.concurrency, not services :slight_smile:

Edit: Whoops, I had a typo myself! [http_service.concurrency] is the correct value

[http_service.concurrency]
  type = "requests"
  hard_limit = 25
  soft_limit = 20

Should get you sorted

1 Like