VM size planning

We’re exploring GPUs! They’re a hard problem because GPUs aren’t meant for multinenancy, so you have to figure out how to allocate an entire GPU to a VM. That also means they’re expensive.

Right now, we have two classes of vms. shared-cpu and dedicated-cpu. We’re planning to release:

  • Larger shared CPU VMs, probably shared-cpu-2x, 4x and 8x. This will you give you shared access to up to 8 CPUs. They might be pretty good for things with bursty CPU.
  • More dedicated CPUs. Right now we only go to 8. We could theoretically go to 24.

We’ve also been considering more permanent “reservations” for dedicated CPU VMs.

What kinds of things could you all use?

3 Likes

It could be really nice to have a CPU option optimized for background/offline jobs, where it’s de-prioritized to only fill otherwise-unused CPU capacity. There must be some kind of cgroup magic that can be done to achieve that. Just to be clear, this would only make sense if that CPU option were also cheaper.

If you could “innovate” even further and make that CPU class only charge for core-seconds actually consumed, that would be amazing.

2 Likes
  • For CI, having a dedicated with high CPU would improve speed
  • 4GB memory shared cpu would be great for a few of our services that aren’t used often but load large models.
  • For more info, the GPU is really not a big priority, so don’t count it as a strong vote from me

Is there an update on when larger shared CPU VMs might become available? e.g. shared-cpu-2x

I would use GPUs to train my machine learning models, on a recurring basis for example.

We have larger shared-cpu-8x VMs available with Fly Machines. The apps you’re used to don’t have access to them yet, but we’re working on that!

GPUs are on my long term wishlist. They’re a hard problem, though, and it’ll take at least a year before we do any GPU work.

Even then, I don’t think we’re going to be an ideal place to train models. We’ll be a great place to do inference on already trained models, though. :slight_smile:

3 Likes

I would like to add some things here on this note.

We (fintech) have successfully trialed running trained Tensorflow models on Fly CPU instances, which as you would expect works without issues.

However, due to the nature of some of our models, on these the CPU alone is simply not enough to reach acceptable performance.

It should be quite obvious that actually training models is not what Fly is meant for, and trying to take on Gradient or Lambdalabs is nonsense and a $10 Colab subscription gives you a P100 to train on, but deploying these models into redundant production is something that I believe could propel your infrastructure to new spheres if it isn’t completely overpriced.

I’m the CTO (and use Fly personally), and have had the “pleasure” of exploring other options for our trained models, so on that note, please let me add that I would absolutely love to use Fly for our business instead of something else, and specifically comment on this reply:

They’re a hard problem because GPUs are meant for multinenancy

The NVIDIA A100 is actually designed to be able to be used in multi-instance/virtualisation environments. I am specifically mentioning this because another option I have looked at - Vultr - offers VM instances with GPU shares that are allocated using this feature. The cheapest option, $90 if I remember correctly, allocates 1/20th of an A100 80GB to the VM, and the bigger plans adding more fractional shares.

I can also assure you that models in general will not be able to run on CPUs for very long until GPUs become by default necessary, and if you can match this pricing to some degree, while still offering the somewhat unique benefits of your infrastructure, I can 100% guarantee you that you will have unimaginable demand lining up at your door very very soon. I know of one business personally that would migrate immediately.

There’s a very obvious and distinct gap in the market here, you could definitely squeeze your foot into the door and get some of the first-movers advantage.

10 Likes