On 12 Aug we started getting an email every day saying
Hello! Your “shields-io-production” application hosted on Fly.io crashed because it ran out of memory. Adding more RAM to your application might help!
The email seems to arrive somewhere in the window midnight - 2am UTC
I have a few questions about this:
- What has changed? Did our application suddenly start throwing memory errors on 12 Aug, or did Fly start sending out a notification email about it on 12 Aug (implying we might have actually been throwing them every day for some time before this).
- I would expect that the thing that would run out of memory would be one microVM, rather than the application as a whole, but the email we get says the application crashed because it ran out of memory. What actually ran out of memory? Was it one (or more) microVMs or did the application as a whole exceed some composite limit?
- My first thought was that these emails were telling me that a particular event was happening every night causing a single out-of-memory error so I tried to isolate the event corresponding with the email timestamps. Running
flyctl vm status <instance-id>for a bunch of our VMs shows that actually there is not just one OOM error happening. I’m seeing several with completely different timestamps from the email e.g:
2022-08-16T13:02:58Z Terminated OOM Killed
2022-08-16T07:52:41Z Terminated OOM Killed
2022-08-16T09:56:57Z Terminated OOM Killed
I now wonder if what this email is actually trying to tell me is something more like “one or more VMs caused an OOM error in the last 24 hours” rather than “an OOM error happened at this specific time”. Could someone clarify that?
I suspect some of this might just be feedback on how to make the email notifications a bit clearer.