Autostop is causing problems for my streaming web API

I have a web API which returns streamed data over an extended period (potentially as long as 5 minutes). The usage of this app is minimal - generally one user issuing no more than a request every few minutes.

As a result, a long running request is typically the only thing running on my app. That’s a problem, as the autostop process sees no incoming requests, and assumes the app can be suspended. When it does, that kills the streaming process that’s sending data to the client :frowning: What’s the best way to fix this? Can the autostop mechanism be configured to detect long-running streaming requests?

I’m happy to consider app-level fixes. I’m very new to web application development, and the streaming API is the most complex thing I’ve done. If someone told me “that’s the wrong way to tackle this problem”, I frankly wouldn’t be surprised!

One possibility I thought of is that I could require the client to call a “heartbeat” API every second while the streaming is happening, as that would look like activity to the autostop process. But that requires the client to know implementation details about the service that I’d like to avoid exposing.

Alternatively, I’ve heard that server-sent events are another way to continuously supply data to a client - are they a possible approach here? I don’t want to spend a lot of time redesigning my code with an approach I’m not familiar with, only to find that it has the same problem, so I’d appreciate it if someone could confirm that it would work better with autostop, before I go down that route…

I guess I’m surprised that streaming responses cause this issue - they seem like a pretty normal thing to do in webapps, from what I’ve seen. Is it just that fly’s default machine management isn’t really designed to support them? After all, my app is very unusual in terms of the incredibly low level of activity it’s designed for. On the other hand, I don’t want to switch off autostop, as that’s a huge waste of resources for an app used so infrequently.

Thanks for any suggestions.

1 Like

Is it the server’s response that’s being streamed, or the original request body?

The Fly.io infrastructure should really be able to handle the former, but there have been glitches in the past; perhaps those have (re-)resurfaced, :thinking:

The server’s response. I agree, it seems like something I’d expect to be handled by the infrastructure.

I don’t really understand the discussion in the linked thread on glitches. But the discussion seems to be around http_service.concurrency.type - mine is unset so I guess it’s the default of “connections” (which seems more likely to be correct based on what I understand of the docs).

That’s how it started, but it turned out in the end there was a bug in the infrastructure, :sweat_smile:

I just tried a small test app with a super-simple streaming text/plain response—no fancy client-side keepalives, WebSockets, or anything—and it was able to stream for the full 10 minutes. (Auto-stop didn’t kick in until 3 minutes after the server had completely finished.)

primary_region = "ewr"

[env]
  RESPONSE_SECONDS = "600"

[http_service]
  internal_port = 8080
  auto_stop_machines = "suspend"
  auto_start_machines = true
  [http_service.concurrency]
    type = "connections"
    soft_limit = 5
    hard_limit = 10

[[restart]]
  policy = "never"

This writes one line every second. There’s a flush after each one, but that’s the only kind of nuance.

And the App Concurrency panel in Grafana correctly showed 1 the entire 10 minutes.

I was testing over Flycast, but generally there’s the same auto-stop behavior there (since it’s the same Fly Proxy).

If both sides (client and server) are idle for a long time, though, then something else in the public Internet might close the connection.

The app concurrency does seem to be showing 1. One oddity, when the machine started (it was idle before I did my test) I got an error from Fly Doctor in the Logs & Errors page, saying

Symptom: App is not listening to the expected port

Something in your code or configuration is preventing your app from listening on 0.0.0.0 at the port specified by your fly.toml internal_port 8080. Your users will see error 502 accessing your app. This is an issue with your application code.

That’s weird, because I’ve checked, and my app is listening on 0.0.0.0 - the command in the docker file is CMD gunicorn ‘webmonaco:app’ -k gthread --workers 4 --threads 4 --timeout 360 --bind=0.0.0.0:8080.

And the message disappeared later, as if the proxy connection problem was just because the machine was still starting up - something I’d expect the infrastructure to deal with.

It actually feels like there’s some sort of timing issue going on - I can run the same query multiple times and get runtimes between 5 and 30 seconds, all of which seems to be due to the streaming (the process generating the streamed data reports a consistent 5 seconds processing time, but varying elapsed times, as if the process reading the generators output wasn’t streaming the data fast enough).

I’ve got no idea how to debug or diagnose this. The application runs perfectly fine locally, using the same dockerfile. There’s no timing inconsistencies, no lost connections, nothing like that. It’s weird.

OK, I’ve done some more investigation. There’s definitely something odd going on here. I have two instances of the application - one running on fly, the other using the exact same Dockerfile and code, running on my local PC.

For the purposes of this explanation, there are two types of run that I do. “Calculation” is running my application, which sends the output of a calculation back to the client at the end, as one batch of data. “Progress” is the same as “Calculation”, but it streams back a progress indicator, basically a series of digits, “01234567890123…”, each one representing 1% of the calculation complete. The “Time” is the time the application reported that it took to complete.

Locally:

  1. Calculation, run multiple times: 3, 5, 3, 3, 4 seconds.
  2. Progress, run multiple times: 4, 4, 4, 4, 5, 4 seconds.
  3. Another “Calculation” run: 4, 4, 4, 4, 4 seconds.

Running the same test on the fly.io hosted version:

  1. Calculation, run multiple times: 5, 5, 5, 5, 5 seconds.
  2. Progress, run multiple times: 5, 5, 9, 14, 202, 31, 135, 136 seconds
  3. Another “Calculation” run: 63, 5, 135, 155, 5, 15, 152, 23 seconds

Apart from the wildly varying times in the second line, I’m extremely confused as to why the third line shows extended times. It’s the same as the first line, and I can only assume that running the “Progress” experiment has somehow persistently affected the machine, so that it no longer runs the “Calculation” case reliably.

If the local run had exhibited the same behaviour, I’d automatically assume this was a problem in my application logic. But given that the inconsistencies only happen on the fly-hosted version, I’m unable to explain what’s happening in terms of the application code.

For what it’s worth, here’s my fly.toml`:

# fly.toml app configuration file generated for applaunch on 2026-01-20T13:16:50Z
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#

app = 'webmonaco2'
primary_region = 'lhr'

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = 'stop'
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[vm]]
  memory = '1gb'
  cpu_kind = 'shared'
  cpus = 1
  memory_mb = 1024

Any suggestions as to what’s going on, or how to fix this, would be gratefully accepted!

Your local machine is different than the fly machines. What exactly does your calculation do, etc…

Sorry, as I say I’m very much a beginner here. How is my local machine different from the fly machines? I thought the advantage of docker was that it (more or less) provided the same environment wherever the container was run? I realise that “more or less” is the problem here, but I don’t know what I should be looking for.

Are you suggesting that the fly machines are running out of RAM, or don’t have enough CPU (things that I know will differ)? That’s a possibility, but I don’t see how that would affect subsequent requests, once the first one has completed and cleaned everything up - I did a lot of testing to try to catch any potential leaks.

I can explain how the webapp works in more detail, but I don’t want to impose too much on people here if I can avoid it. I’m happy to do my own debugging of my code, once I know what differences I should be worrying about in the runtime environments.

As a very broad overview, my webapp works as follows. It’s a Python application, written using flask, with a single endpoint /run. The client POSTs a JSON payload containing a list of arguments, and the endpoint spawns a subprocess, running an executable deployed with the application, with the given arguments. The endpoint then fires off three threads, one to read the subprocess stdout, one to read stderr, and one to monitor the process every second. All three send data as collected, via a queue, to the main thread, which then yields that data back to the endpoint response as streamed data in line-delimited JSON format. The time consuming work is done by the called application, but I don’t have control over that, so I have to assume it’s not the problem…

Sounds like your app is getting cpu throttled.

Try switching to performance and see if that helps.

1 Like

Oh, thanks. That’s an interesting possibility. I’ll look into it.

The older version of the app (which buffered all the output and did a synchronous response with everything at once) didn’t have this issue, as far as I’m aware. I would have expected it to if it’s CPU. I’ll try running some tests on the old version to see if it was having problems, and they just never got reported to me…

Thanks for that, it’s a good lead to investigate :slightly_smiling_face:

I’m becoming more and more convinced this is CPU throttling. But it didn’t happen with the older version of the app, and while the application code itself has changed, the new code really shouldn’t be using more CPU than the old one (the CPU-bound part of the code is unchanged).

Is there a way to create an exact copy of the old app under a new name? I’d like to create an exact copy, then deploy the new code, and see if that still has the same issue. I can deploy using the existing fly.toml, but I need to create the app (fly app create). And the defaults seem to have changed since I created the original - new apps seem to be getting 2 machines, where my old app only has one, for example. Of course, I can change things after creation via commands like fly scale, but because I don’t know what particulars of the configuration might be the issue, and I don’t know how to get details of the full config, to do a diff (fly.toml doesn’t include details of how many machines are assigned, for example).

Newly launched apps default to 2 instances. You can check the metrics on grafana to see the historical cpu usage.

Yes, that’s what I see. But my current live app has one instance, and I’m 99.99% certain I didn’t deliberately reduce the number of instances, so I guess when that app was created (more than 2 years ago, I can’t find the exact date), the default was 1. That’s the point I’m getting at, I don’t know how to create a new app the same as the old one, because “use the defaults” appears to have changed, and I don’t know how to get the information necessary to override the current defaults to match how the old app was created.

fly m list will show the CPU and RAM settings, and the [[vm]] stanza in fly.toml will override those aspects of current Machines (except in a few unusual corner cases).

I don’t think you have volumes, but, if you did, you would also want fly vol list.

There unfortunately isn’t a wholesale app-cloning operation. (That comes up a fair amount, when people learn that apps can’t be renamed.)

Yeah, one Machine is a bad idea, incidentally. That’s the entire story of why the default was changed, :dragon:

Thanks, that’s helpful.

I’m not surprised. For my app, with basically two users and maybe a few requests a month, I doubt it matters. But for anything with real traffic, I can see that a single machine is a bad choice (especially as a default, which is what people like me who don’t know what they are doing will get).

At the moment, I’m trying to change as little as I can at once (given that I made the mistake of starting by doing a huge change that messed everything up!) to make sure I minimise the variables I’m dealing with.

OK, I had an inspiration, and realised I can run my old code on the new app, and test that (doh, I really should have realised and tried that much earlier :slightly_frowning_face:). And that works without the slowdowns. Reintroducing my changes one at a time has established that it’s when I add the streaming code that the problem starts. So it is in my code, even though I can’t see what might cause this on the fly machines but not in a local docker container (CPU throttling remains my best guess for that, but I don’t see why my new code would trigger throttling when the old version didn’t). I don’t know how to fix the problem now I’ve found it, but that’s for me to deal with.

Anyway, problem “solved” - it’s basically user error. Sorry for taking up people’s time with this. I got a lot of useful information from this thread in any case, so I appreciate the help people gave.

1 Like

Streaming can be CPU intensive, so that might have pushed your app beyond the limit. You should switch the a performance machine if you don’t want to get throttled.

Unfortunately, a performance machine still has the same issue. Maybe slightly less drastic, but definitely still there. So I think I’m doing something wrong, beyond just overheads from streaming vs sending all the data in one go. I just wish I knew what :slightly_smiling_face: