One of my apps both machines just stopped

Hey,

For one of my apps I run two machines (in two different datacenters) and both of them just stopped for a 5hour period (till I manually restarted them).

Based on the memory and CPU graphs nothing out of the ordinary happened before the shutdown.

The app is also configured with active healthchecks which I assumed should have restarted the instance in case of an crash. Can someone from the fly team provide some more insight? The app in question is cfps-cors-proxy.

Just following up so this does not get closed

Thats scary…hopefully someone can provide some insight.

Hi @fkrauthan,

Your application seems to be crashing, quite often. By default, Machines will automatically restart if they crash, but only up to the configured limit (the default restart policy is to retry 10 times within a 5-minute interval), and your machines were stopped without restarting when they crashed repeatedly too quickly. (You will find a machine has reached its max restart count of 10 line in your application logs.) I’d suggest debugging your application, but if frequent crashing is okay for your use-case, you can either adjust the restart policy, or configure autostart on your services so that machines will always attempt to be started to serve any incoming request.

Hi @wjordan ,

Thanks for looking into it, but where exactly do you see this crashes? For some reason the UI for my app seem to be bugged as the Health check changes indicate 100 change(s) during the past 48 hours while when I look at the actual list forst of all it always seem to be only one of the two nodes (which is already strange in of itself) and second there are max of 13 events for the last 2 days.

Also searching for Running CORS proxy on (the first log after service start) doesn’t show any excessive restarts (still higher then it should). But the service is a very simple nodejs http server and I don’t see any logs indicating why it is suppose to be crashing.

But I will look a bit more into that. But would be great if the UI could be updated because of that I never noticed that there might be an actual issue.

In your application logs over the time period mentioned, you will find hundreds of crashes and restarts, you can use the Search Logs feature (see docs) to look through them.

Ah interesting, I just saw that. I found the issue (pesky security scanning robots)… Thanks for looking into it. It just felt strange since the UI was a bit inconsistent.

I think the issue is resolved.