App continuously going offline with no alerts, warnings or errors

I’ve got an app that connects to an API and then has event handlers that wait for events and then run. So I cannot run the app with more than one instance because those events are then duplicated. It’s basically a similar setup to the one described in the cron article.

I noticed as of the 26th, my app has been restarting continuously and typically with a delay between restarts (region lhr, backup ams - 256mb shared cpu app). For instance, the most recent had a gap of 20 minutes where no instance was running and I had no alerts/errors/warnings, I just happened to notice it.

This is a horrible experience and I would have never known that this was happening without manually monitoring the apps page. Here’s an overview showing the last 7 days, the consistent memory and CPU usage, and the vast number of instance changes happening in between. Logs also don’t show any errors apart from a signal to shutdown from the runner, which I didn’t trigger:

I was thinking of perhaps upgrading to v2 since the article mentions that it helps with reliability but the article also pushes to have multiple instances because the underlying architecture has changed from how scaling works on v1. This seems like it would break my setup and potentially result in more downtime because fly will not try ensure that I always have at least one instance running.

Questions:

  1. Has anyone experienced the same constant restarting even though limits aren’t being reached?
  2. Are there any recommendations on how to improve reliability while still keeping the requirement of having a single instance running at any point in time?

Are you losing events when the app restarts? If that is the case, you probably want 2+ consumers for that API.

If you can deal with a short period of duplicate event processing, usually by having a unique event ID to dedupe, then porting the app to v2 and using a standby machine can be the solution.

Check this post for an explanation on how standbys work Increasing Apps V2 availability

The issue is that the app stops then restarts 20 minutes later, so the period of time between when it goes offline and comes back is lost yes. There’s no playback mechanism for the API.

I haven’t thought about if there’s an Event-ID associated with the event actions. I will look into that and see if I can store them in Redis. I’m using sessions so I hope the sessions won’t break but it would allow me to have a standby machine for sure.

Cheers @dangra for the suggestion

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.