Machine silently stuck in "failed" state

I have an app with 1 machine responding to some web hooks, with scale to zero. Very light stuff. The app uses little memory. Suddenly it stopped working and when I checked the logs I saw that it failed to start because of “insufficient memory”, and after a few attempts the machine got stuck in a “failed” state. I fixed the issue by destroying the machine and creating a new one (fly scale count 0 and then fly scale count 1).

The thing is I didn’t change anything in the app. Is the same deploy that was working properly, it stopped doing so and started working again after recreating the machine, without changing anything.

I know that I should have at least 2 machines per app but regardless, I have a few questions/concerns:

  • Why did this happen? Is the same deploy, with the same memory consumption, same config, that was working properly and suddenly got stuck in a “failed” state.
  • Can I be notified somehow if a machine is stuck in a failed state without checking constantly the monitoring page?

Thanks! :slight_smile:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.