Apologies for the catchy title, the true story behind it is much more boring but impactful on difficult times.
Since last week we have been testing a new capability. You may have heard about machine migrations and capacity rebalances already, where we move machines to a new host to ensure they can start successfully. These are triggered on extreme conditions like when there is a host degradation due to hardware failure, or when load is reaching a point that it starts affecting others tenants.
But sometimes those conditions aren’t met, and yet the dormant machines waiting for incoming requests fail to start due to lack of capacity. For example, stopped GPU machines may not start because there aren’t available GPU cards on its host at that point in time.
To overcome this limitation, we’re enabling auto-migration, aka moving a machine from one host to another with idle resources on start, for GPU machines and all non-volume attached machines. Non-GPU machines with volumes attached won’t be automatically migrated by this capability (to avoid potential issues with Postgres).
That’s it. You don’t have to do anything, it works behind the scenes to ensure your app is up when needed.