I would like to add some things here on this note.
We (fintech) have successfully trialed running trained Tensorflow models on Fly CPU instances, which as you would expect works without issues.
However, due to the nature of some of our models, on these the CPU alone is simply not enough to reach acceptable performance.
It should be quite obvious that actually training models is not what Fly is meant for, and trying to take on Gradient or Lambdalabs is nonsense and a $10 Colab subscription gives you a P100 to train on, but deploying these models into redundant production is something that I believe could propel your infrastructure to new spheres if it isn’t completely overpriced.
I’m the CTO (and use Fly personally), and have had the “pleasure” of exploring other options for our trained models, so on that note, please let me add that I would absolutely love to use Fly for our business instead of something else, and specifically comment on this reply:
They’re a hard problem because GPUs are meant for multinenancy
The NVIDIA A100 is actually designed to be able to be used in multi-instance/virtualisation environments. I am specifically mentioning this because another option I have looked at - Vultr - offers VM instances with GPU shares that are allocated using this feature. The cheapest option, $90 if I remember correctly, allocates 1/20th of an A100 80GB to the VM, and the bigger plans adding more fractional shares.
I can also assure you that models in general will not be able to run on CPUs for very long until GPUs become by default necessary, and if you can match this pricing to some degree, while still offering the somewhat unique benefits of your infrastructure, I can 100% guarantee you that you will have unimaginable demand lining up at your door very very soon. I know of one business personally that would migrate immediately.
There’s a very obvious and distinct gap in the market here, you could definitely squeeze your foot into the door and get some of the first-movers advantage.