[application name] ran out of memory and crashed

Hello.

On 12 Aug we started getting an email every day saying

Hello! Your “shields-io-production” application hosted on Fly.io crashed because it ran out of memory. Adding more RAM to your application might help!

The email seems to arrive somewhere in the window midnight - 2am UTC

I have a few questions about this:

  1. What has changed? Did our application suddenly start throwing memory errors on 12 Aug, or did Fly start sending out a notification email about it on 12 Aug (implying we might have actually been throwing them every day for some time before this).
  2. I would expect that the thing that would run out of memory would be one microVM, rather than the application as a whole, but the email we get says the application crashed because it ran out of memory. What actually ran out of memory? Was it one (or more) microVMs or did the application as a whole exceed some composite limit?
  3. My first thought was that these emails were telling me that a particular event was happening every night causing a single out-of-memory error so I tried to isolate the event corresponding with the email timestamps. Running flyctl vm status <instance-id> for a bunch of our VMs shows that actually there is not just one OOM error happening. I’m seeing several with completely different timestamps from the email e.g:
    • 2022-08-16T13:02:58Z Terminated OOM Killed
    • 2022-08-16T07:52:41Z Terminated OOM Killed
    • 2022-08-16T09:56:57Z Terminated OOM Killed
      I now wonder if what this email is actually trying to tell me is something more like “one or more VMs caused an OOM error in the last 24 hours” rather than “an OOM error happened at this specific time”. Could someone clarify that?

I suspect some of this might just be feedback on how to make the email notifications a bit clearer.

Thanks

Happy to help provide some additional context here-- thank you for asking!

What has changed? Did our application suddenly start throwing memory errors on 12 Aug, or did Fly start sending out a notification email about it on 12 Aug (implying we might have actually been throwing them every day for some time before this).

We did start sending these emails out recently. It’s quite possible that your app was running into intermittent memory issues previously. When apps crash (for instance, OOMing), our orchestration layer will restart your application.

I would expect that the thing that would run out of memory would be one microVM, rather than the application as a whole, but the email we get says the application crashed because it ran out of memory. What actually ran out of memory? Was it one (or more) microVMs or did the application as a whole exceed some composite limit?

These emails can be triggered in the case that there is a single app instance that OOMs. You can further inspect your application with tools like fly status --all and fly logs -i. The former will display all of your app’s instances from the past 7 days and the latter can likewise retrieve per-instance logs within that timeframe.

I now wonder if what this email is actually trying to tell me is something more like “one or more VMs caused an OOM error in the last 24 hours” rather than “an OOM error happened at this specific time”. Could someone clarify that?

Yup! Since certain types of memory errors can be triggered quite frequently, we’ve configured things so that you’ll only get one of these per-app, per day. That way, you can know to check things out without being inundated with email alerts :sweat_smile:

1 Like

Thanks for the clarifications - that all makes sense :+1:

1 Like

Hi,

We’re seeing a few of these, but we’ve provisioned more RAM and still getting the emails, the charts look fine.

on the app: concordia-production-web