Slow/unresponsive Machines in FRA across multiple apps

Hi Fly team,

I’m seeing slow or stuck behavior across multiple Fly apps in fra.

This is not just one app and not just provisioning. All normal app actions are slow. I tested from two different client machines.

Symptoms:

  • Public route may respond, but actions inside the app hang or take a long time.

  • One request connected over TLS but received no bytes for 60s.

  • Logs show event-loop delay and delayed fetch timers.

  • One Machine showed very high CPU steal in top:

%Cpu(s): … 90.9 st

Example log lines:

liveness warning: reasons=event_loop_delay,event_loop_utilization
fetch timeout after 10000ms ... timer delayed ... likely event-loop starvation

All affected Machines are in fra and use shared CPU.

Can someone from Fly check whether there is host pressure or a FRA issue affecting shared CPU Machines?

Thanks.

Thanks for reaching out!
You didn’t say which of your apps this happens to, but I checked and I found one that’s heavily CPU-throttling. You’d need to either scale to a larger machine or reduce the amount of work the app does (have it accept fewer requests? reimplement your code? many ways to do this).

CPU throttling is explained here, and you can see metrics about this in your dashboard. Go to the app, metrics, click on the Grafana icon and go to “fly instance” to see how CPU is faring on each of your machines.

Thanks for the reply
Which app? (please specify the last chars only)

Thanks

Hi, it’s the one ending in cb92fc.

Sorry, the link to the CPU throttling explanation is this: https://fly.io/docs/machines/cpu-performance/