Severe API request delays during batch processing

I’m experiencing inconsistent performance with my fly.io deployed application. The app handles high-volume batch processing jobs efficiently, demonstrating good concurrency. However, when sending individual API requests via Postman concurrently with these batch jobs, I’m seeing extremely slow response times (30-60 seconds) or timeouts.

Key points:

  • App performs well under high load from batch processing jobs

  • Individual API requests are fast when batch jobs are not running

  • During batch processing, API requests from different IPs experience severe delays

  • This suggests a potential issue with how the load balancer or resource allocation handles mixed traffic types

My goal is to maintain responsiveness for individual API requests while batch jobs are running in the background. Could fly.io investigate if there’s an issue with how concurrent connections from different sources are being managed, particularly in relation to the load balancer or resource allocation?

I appreciate there are architectural solutions like splitting resources so batch processing is on separate worker etc. Perhaps naively, I still expect the service to handle or two extra requests ad hoc whilst batch processing jobs are running.

Any ideas / possible reasons why this might be occurring

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  processes = ["app"]
  [http_service.concurrency]
    type = "connections"
    hard_limit = 500
    soft_limit = 100

[[vm]]
  cpu_kind = "shared"
  cpus = 1
  memory_mb = 2048

[autoscaling]
  min_machines = 0  # Allow scaling down to 0 machines
  max_machines = 4

Have you looked at your memory and cpu load during the period of high latency? If I had to guess, your memory was near max (~80%+) which caused the app to come to a halt.

Hey @khuezy thanks for pointer. I just checked memory and it’s very close to limit. Have some big Python packages. So will bump memory to 4GB see what happens.

You should separate your batch jobs into their own app/process.

Yeah basically I have X3 batch jobs as their own docker apps that communicate a lot with X2 micro-services as their own apps/processes. The microservices are exposed via http to internal and external behind api-gateway. If external request micro-service whilst its under heavy load from high read/writes from batch jobs I get errors mentioned above ^^. So splitting out resources of micro-services to smaller services with own compute would probs do trick. But quite a bit work to do that so will go memory route as a short term fix.

Strange even with 4GB of ram. Where average memory usage is 1.8GB with roughly 150 concurrent connections per second with a limit of 500. I still get nearly 2 min response time if I hit a route from postman whilst job is running.

If your memory is no longer the issue (keep it mind it’s not about average, but max usage,) then it’s bottlenecked by your 1 cpu (shared). It all depends on what your request is doing though. You’ll need to change it to performance and scale the number of CPUs up.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.