Lack of capacity? no problem, we have you covered

Apologies for the catchy title, the true story behind it is much more boring but impactful on difficult times.

Since last week we have been testing a new capability. You may have heard about machine migrations and capacity rebalances already, where we move machines to a new host to ensure they can start successfully. These are triggered on extreme conditions like when there is a host degradation due to hardware failure, or when load is reaching a point that it starts affecting others tenants.

But sometimes those conditions aren’t met, and yet the dormant machines waiting for incoming requests fail to start due to lack of capacity. For example, stopped GPU machines may not start because there aren’t available GPU cards on its host at that point in time.

To overcome this limitation, we’re enabling auto-migration, aka moving a machine from one host to another with idle resources on start, for GPU machines and all non-volume attached machines. Non-GPU machines with volumes attached won’t be automatically migrated by this capability (to avoid potential issues with Postgres).

That’s it. You don’t have to do anything, it works behind the scenes to ensure your app is up when needed.

5 Likes

Amazing! Thank you for the update.

Two questions come to my mind:

  1. Do auto-migrated machines stay in the same region or may it be auto-migrated to another region?
  2. Is there or will there be a way to auto-migrate non-GPU machines with volumes attached that are not Postgres machines?
2 Likes

@amo glad you liked it.

  1. Yes, auto-migrations are always within the same region
  2. Not yet, but we plan to revisit this decision at some point
2 Likes