High CPU steal

Occasionally our shared CPU instances will experience extremely high steal (70%+) for extended periods of time (hours to days). This is relatively rare and is easily fixed by destroying the problem machine and cloning a new one. Our apps are single process, so I’d guess this is not an “us” issue. While this hasn’t caused us much trouble we’d love to know a bit more about what’s going on:

  • Is steal introduced by other tenants on the host, or are there other causes (e.g. Fly processes, bad hardware, etc.)?
  • Why does destroy/clone fix the issue (e.g. do we get a new host)?
  • Is this something that is expected with shared CPUs under the current architecture?
  • Does Fly have any plans to mitigate this or automate the fix (e.g. perform a machine migration for us if steal remains above a certain threshold)?

Thanks!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.