Feature Request: "busy" endpoint/busy check endpoint for fly.toml

Quite often a http request will create a long running job. For me it happens to be github webhooks which I must answer with 2xx within 10 seconds, or github will timeout their request.

So in my case, I simply create a work item on a queue, and then immediately return 2xx.
However it would be great if I could enable “scale-to-zero” on fly.io for my service. Right now it isn’t possible because it will scale to zero well before my job queue is empty.

There are many other situations also where this could be useful to have. Therefor I propose the following feature:

  • A busy endpoint/busy check, which can be configured in fly.toml (just like health check endpoints), where fly can query an instance to see if it can be shut down. In the future this could also enable functionality such as persistent leaders and dynamic number of workers/helper instances - keeping the leader alive (avoiding a leader re-election)

We don’t have plans to add anything like that for the moment, however here’s something that might work:

By default, we send a sigint to kill your app as documented here (search for kill_signal.) You can also bump the timeout to a maximum of either 5 minutes or 1 day depending on what VM type you’re using.

What I’d do is bump the timeout and add a signal handler to set a global flag or some other globally-accessible state when sigint is received. Then, when your job queue is empty or at other appropriate points, check that flag and exit if it’s true.

Does that address your use case?

Cool. If the app processes the signal quicker than the timeout, will it be shut down as soon as the processing completes?

Just guessing, but I’d assume not. If you’re setting a timeout and adding a handler, the onus is on you to do something like:

  1. If there’s no work in the job queue and the signal is received, exit.
  2. If there’s work in the job queue, set some global state and return from the handler.
  3. When processing work, check the value of this global state when the queue is empty. If it indicates signal receipt, exit.

Once you’re handling signals, it’s your responsibility to do what the sender of those signals expects.

The correct way to do this is to disable auto_stop_machines and have your worker process exit when it’s done running jobs (or hasn’t been running jobs for x time, or whatever other criteria you need).

1 Like

Thanks for both of your answers :slight_smile:

They both solve diff situations I have right now

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.