Occasionally our shared CPU instances will experience extremely high steal (70%+) for extended periods of time (hours to days). This is relatively rare and is easily fixed by destroying the problem machine and cloning a new one. Our apps are single process, so I’d guess this is not an “us” issue. While this hasn’t caused us much trouble we’d love to know a bit more about what’s going on:
- Is steal introduced by other tenants on the host, or are there other causes (e.g. Fly processes, bad hardware, etc.)?
- Why does destroy/clone fix the issue (e.g. do we get a new host)?
- Is this something that is expected with shared CPUs under the current architecture?
- Does Fly have any plans to mitigate this or automate the fix (e.g. perform a machine migration for us if steal remains above a certain threshold)?
Thanks!