I have something similar. Here’s my arch: How to prevent contention between background jobs in a multi-machine set-up?
I am building an API in my middle app, Distributor. This receives requests from Web that write to the database. I also have several Supervisord processes in the background that watch for database changes using simple loop polling:
- If a status is
requested
and the record has not been processed, clone a template (stopped) machine and start it, and change the status tostarting
- If a status is
starting
and the target machine can receive requests, send the request and change the status tostarted
- If there are too many machines running, add a minute to the request start time and change the status to
delayed
- There’s a process to put
delayed
requests back torequested
if the running machine count drops - Every time a request is rejected it bumps up a
retry
count, and there’s a process to permanently reject requests that are rejected too many times (e.g. invalid request)
This may seem like a lot of work, but each process is only a bit of SQL run against a managed database, and a bit of language logic to move records between different statuses. Some queries have to be protected against race conditions, bearing in mind that this process will have redundant copies.
Your situation will be a little different in that rather than starting machines on demand, you have a pool of running ones. That’s just another process in a process manager to start extra machines over the number that are occupied with real users. You might also have something to destroy machines that no longer have a user, so that the free pool is kept at a consistent number.