App continuously going offline with no alerts, warnings or errors

Skowt · April 28, 2023, 1:30pm

I’ve got an app that connects to an API and then has event handlers that wait for events and then run. So I cannot run the app with more than one instance because those events are then duplicated. It’s basically a similar setup to the one described in the cron article.

I noticed as of the 26th, my app has been restarting continuously and typically with a delay between restarts (region lhr, backup ams - 256mb shared cpu app). For instance, the most recent had a gap of 20 minutes where no instance was running and I had no alerts/errors/warnings, I just happened to notice it.

This is a horrible experience and I would have never known that this was happening without manually monitoring the apps page. Here’s an overview showing the last 7 days, the consistent memory and CPU usage, and the vast number of instance changes happening in between. Logs also don’t show any errors apart from a signal to shutdown from the runner, which I didn’t trigger:

I was thinking of perhaps upgrading to v2 since the article mentions that it helps with reliability but the article also pushes to have multiple instances because the underlying architecture has changed from how scaling works on v1. This seems like it would break my setup and potentially result in more downtime because fly will not try ensure that I always have at least one instance running.

Questions:

Has anyone experienced the same constant restarting even though limits aren’t being reached?
Are there any recommendations on how to improve reliability while still keeping the requirement of having a single instance running at any point in time?

dangra · April 28, 2023, 3:45pm

Are you losing events when the app restarts? If that is the case, you probably want 2+ consumers for that API.

If you can deal with a short period of duplicate event processing, usually by having a unique event ID to dedupe, then porting the app to v2 and using a standby machine can be the solution.

Check this post for an explanation on how standbys work Increasing Apps V2 availability

Skowt · April 28, 2023, 3:52pm

The issue is that the app stops then restarts 20 minutes later, so the period of time between when it goes offline and comes back is lost yes. There’s no playback mechanism for the API.

I haven’t thought about if there’s an Event-ID associated with the event actions. I will look into that and see if I can store them in Redis. I’m using sessions so I hope the sessions won’t break but it would allow me to have a standby machine for sure.

Cheers @dangra for the suggestion

system · May 5, 2023, 3:52pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Instance or service not restarted when I expected it to Questions / Help	5	1145	July 26, 2022
It's been 38hs and my instance is still experiencing an outage	8	489	October 4, 2023
Any issues/downtime last night/this morning?	2	308	November 30, 2021
No suitable (healthy) instance found to handle request	9	330	October 28, 2021
Cause of instance restart unclear	14	1161	December 11, 2020

App continuously going offline with no alerts, warnings or errors

Related topics