Get In Losers*, We're Getting Off Nomad

image
(*Nobody here is a loser, we just wanted a Mean Girls reference)

We’ve been making a lot of noise about Apps V2 lately - and not without cause. They are more reliable, easier to control, and sometimes even cheaper! We’re finally ready to make a push to get everyone off of Nomad (V1) and onto machines (V2).

We’re going to talk a little bit about the “why” for moving over to Apps V2, but we’re also including a rough map of how we expect this process to go and what you need to be aware of during the transition.

Why should I migrate my apps?

We’ve been holding Nomad wrong. It’s great software - it’s also not intended to be used for what we’re using it for. Having custom, in-house orchestration gives us the chance to make everything way more configurable anyway; instead of a black box that takes app configs and spits out VMs, you get sensible defaults and the ability to customize the hell out of your app as you see fit.

In more concrete terms, this means faster deploys, much better reliability, granular control over particular VMs, and much more predictable lifecycles for your apps.

So what’s the plan?

  1. We’ve flipped the giant “Apps V2 Default” switch for all new orgs. Hooray :tada:
  2. This post informs everyone that we’re serious about moving off of Nomad. We’re serious about moving off of Nomad. You have been officially informed.
  3. On Tuesday (May 16th), we’ll flip another comically oversized lever - this time, it’s the “Apps v2 for all new apps” switch. Once this happens, all new apps end up on the new platform regardless of account age. During this phase, existing Nomad apps behave just the same as before.
  4. We start migrating the remaining V1 apps from the backend in phases, starting with tiny single-instance apps and moving upward in complexity. We will do these in batches so we can properly support you after your migration.
  5. :crab: Nomad is gone :crab:
  6. Profit…?!?

What can I do if something goes wrong?

We’ve been making a lot of internal changes lately to improve our support structure. If something goes wrong, make noise wherever you can (be it a paid support mailbox or the community forum) and we will work to make sure this process goes smoothly for everyone. We’re committed to making sure that you come out on the other end with a more reliable app that’s easier to work with.

If you want to be proactive, go ahead and migrate your apps! Doing this yourself can give you the peace of mind in knowing that there aren’t any surprises down the road. If you’re worried about this or need extra time, reach out to us and we will help you through it.

If you don’t want to migrate your apps yourself, they will be migrated to v2 automatically over the next couple weeks. We’ll keep you updated on our progress so you know what apps are up for migration next.

Once your apps are moved to the new platform, you should be able to use the same commands and dockerfiles you know and love - just, this time, everything will be faster and more reliable.

TLDR: We are phasing out V1/Nomad apps over the coming weeks, starting with making all new apps V2 apps, then migrating existing apps over to the new platform.

26 Likes

How are you ensuring uptime for apps with volumes? From previous information, apps with volumes go down during the migration.

Thanks for the heads up.

Quick question regarding deployments, with V1, it was possible to use what you call bluegreen strategy, with health checks.

From what I gathered, V2 deployments don’t have health checks to deploy gradually.

If I remember correctly, with bluegreen, V1 apps used to have automatic rollbacks in case of health check failures.

Are there any plans to bring these deployment strategies to V2?

1 Like

We don’t currently support bluegreen deployments for Apps v2. There is, however, work ongoing to support canary deployments, which should satisfy the same requirements in many cases.

Apps V2 supports health checks. To give an example of a failure case in an app with rolling deploys, the bogus build would be sent to the first machine, flyctl would wait on that machine to respond with its health check status, the health checks for this machine would fail, then the remaining machines would not be deployed to. I do believe, however, that that would leave the first machine in a failed state until a new deploy is attempted.

2 Likes

Allison’s response covers most of it, but bluegreen deploys are on our radar.

1 Like

When do you anticipate this happening?

I’m just curious. I’ve already migrated all my apps to V2.

I don’t have a great answer to that.

Almost the whole apps team is working together to make this happen, but if something goes wrong (or even just makes us nervous about bulk-migrating), we’re going to pause migrating until that’s solved. There’s a lot of trust being put in us, and we do not take that lightly, so we have to be very careful about changes like this to make sure it goes smoothly.

That said, we’re going to be very communicative about this. When we start bulk-migrating, you’ll hear about it on the forum. When we’re done bulk-migrating, you’ll hear about it. If we have to pause, for some reason, we’ll probably talk about it on the forum.
Hopefully that makes up for the lack of a timeline!

5 Likes

We have not been able to migrate any apps with volumes due to capacity issues in ORD.

Any updates on this? These are pretty small volumes 10-50GB volumes, I kinda find it hard to believe there is that little capacity available? I just want to make sure this is not a bug.

Thanks!

We’re working on it! Regions have multiple workers (individual servers), and if a worker is filled up, you can’t fork the volume. There are quite a few workers that are Just Full, but previously, nobody would’ve ever run into issues from it because our backend would just provision the new volume on a worker that has space.
Hang tight, we’ll have something to announce soon!

1 Like

Hi @allison - thanks for the update. So in the meantime we should just hold tight? Or is this something that we should continue to retry through the day in the hopes that there is capacity that frees up? Is this something that fluctuates depending on how many apps are concurrently migrating or is this something that is full and will remain full until an update / announcement is made?

Just want to make sure we manually migrate our apps to V2 before the automatic migration.

When we have a solution, it’ll be announced on this forum first. I think we’re all a little antsy to get the show on the road, so I can’t promise it’ll be announced long before the automated migrations start*, but it will be announced here in advance.

I wouldn’t hold my breath for more space to show up. Theoretically, it could, but it’s unlikely.

*automated migrations could possibly start before this feature is announced, but we won’t be migrating apps with volumes until we’ve worked all of this out. everything we’re using to migrate in the backend is the same stuff we’re exposing to all of you, so if you’re blocked by the feature not supporting your use-case, we’d be blocked by the same thing.

2 Likes

Update! We’re ready to start migrating the simplest apps to Apps V2, in small batches. You’ll get an email if we migrate your app.

6 Likes

Hi!

Is there an email list or something where changes like this get announced on? I haven’t had time to keep up and just got a bunch of emails saying things had migrated.

(They all migrated successfully - it was just an unexpected “hey something changed”).

v2 looks cool though!

Right now it’s just the forum. For what it’s worth, most product updates are announced in the “Fresh Produce” category of the forum, so if you subscribe to email notifications for that category it might be helpful (albeit a little noisy, we post a lot!)

If you want to do that, you can

  1. Click this link to go view the “Fresh Produce” category on this forum
  2. Click the subscription bell in the top right corner, and select “Watching First Post”.
    the category notification level settings menu
3 Likes

Few of our apps have now been successfully converted to v2, looks good!

Are we the only ones having some problems with logging? For few apps, it seems app is doing allright, but on dashboard and/or flyctl logs there are only old v1 logs available.

1 Like

@allison - it appears I have apps that have been automatically migrated to V2 overnight. However, I suspect one of them has failed the migration and is now in some sort of strange state (re: Fly’s back-end).

  1. As far as I can tell (is this a pre-req to/part of the migration process?) the scale count has been reduced to 1, it was previously 2.
  2. I noticed it was reported as suspended, so I restarted it. No dice - still suspended, although it appears it is actually up and working.
  3. I’m not sure if suspended is a valid state for Fly-Nomad app status (is the Nomad equivalent dead)? The platform for the app is still reported as being nomad.
  4. Logs for the app appear to have stopped since the (I assume) failed automated migration.

Though I appreciate the business and reputational imperative in force-migrating to people to Apps V2, in the interests of transparency I think it would be useful if Fly could create a new post documenting known feature shortfalls between the two (example from this thread: “bluegreen strategy, with health checks”).

Another example (TBC): with Nomad I believe it was possible, albeit subject to Fly-Nomad deficiencies (potential 15 minute delay/etc), to have a single always(read:mostly)-available instance. In the event of hardware failure Nomad would automatically move the VM to another host(?). I understand that machine instances are tied to hardware - I’m not sure if Apps V2 supports this use case? It would need two machines - but with Fly’s in-house orchestration ensuring only one is running at any point in time, and without reliance on external connections to trigger the proxy to bring up the 2nd/backup machine.

I’ll refrain from speculating too much, but it does sound like we tried to auto migrate your app and it failed.

For a little context, “suspended” is real old terminology from when you could suspend/resume nomad apps, and when machines was first built the “suspended” flag was overloaded to mean a machines app with no machines. During migration, at some point the app had no machines, so that flag got set. It just never got unset when things failed and it tried to restore your app to the previous state. You should be able to fly resume <appname> to get it back in working order.

We’ve seen a few people mention logs being strange after migration attempts. We’re looking into it.

Bluegreen is sadly not supported right now. It’s being looked into, but honestly, most of the people working on the apps platform are working on making sure this migration goes well right now. In the meantime, we do support canary deployments, which are pretty close!

As for having a single instance of an app that is relatively resilient, you might be looking for standby machines? Essentially, these are machines that are pointed at another machine, and turn on when their target machine is unreachable. Two caveats here, though: I don’t know what happens when/if the original machine comes back up, and they only get added for processes that do not expose a service.

I already asked about this before, but never got a reply. How do I get rid of this notification?

I neither want or need more than 1 instance.

And how do I scale down to 0 and back to 1 machine?

I simply run flyctl scale count 0 and then flyctl scale count 1, which immediately fails. With this message:

Error: there are no active machines for this app. Run fly deploy to create one and rerun this command

I tried runnin flyctl deploy this a couple of times, but every time I lose my machine settings, e.g. I have a machine with 512MB of RAM, scale down to 0, deploy, and it’s restarted on a machine with 256MB of RAM, which is not enough for startup and gets stuck until I scale the RAM back up manually.

There is a fix for this in the pipeline which should be available in the next flyctl release.

2 Likes

Thanks, one more thing: when deploying from scratch the tool automatically creates two instances:

This used to start just one instance, why did it change, and why is it the default now? How do I change it back?