https://dicebot.idk.club (https://dicebot.fly.dev) stopped service.

We received a user report (Stopped working today. · Issue #66 · idkclub/dicebot · GitHub) that the hosted service had stopped responding to requests. In general this app receives a continuous level of usage from ~10,000 Slack teams, but on checking the metrics reflected no usage from the last two days (max graph time).

As a general debugging step, I redeployed the same code that was deployed three months ago, which had been working (or at least hasn’t had issues to the contrary :grimacing:) since then, and it is now back up.

Is it expected that deployed apps fail after a set amount of time, or is there another issue here? All code is available at GitHub - idkclub/dicebot: 🎲 /roll support for Slack and this issue is filed in response to Stopped working today. · Issue #66 · idkclub/dicebot · GitHub

Please let me know if any other details would be helpful!

I don’t see any obvious problems, it looks like it might’ve just hung and stopped responding at some point. Our health checker didn’t have any problems connecting but the logs from the process just stopped.

If you run fly scale count 2 it’ll give you some redundancy when one process goes awry.

Ran, and it seems to have generated a new version (although not sure where the current concurrency is visible?). The code in question ran for year(s) on Vercel, so I do wonder if it’s exposing a latent bug in the serving infrastructure? Hopefully the manual scaling does give some redundancy to at least lengthen the MTBF.

I guess as a feature request, it’d be nice to know when something has gone wrong (although I get the difficulties if the health checker doesn’t see it). As a note, HTTP service was completely hung on trying to connect - not sure what the health check is, but could it be improved by hitting the service over the defined gateway?

Thanks again!