Hi! I’m building out a POC for our application, and I’m running into an issue where requests to our app running in fly.io timeout intermittently. I will have a bunch of good requests, and then one or a few will timeout. Then a bunch more good ones, and then another set of timeouts. I originally thought that we might be running out of RAM since we were running pretty close to the 256MB limit. I upped the containers to 1GB of RAM, but I’m still seeing the issue.
The containers are physically close to my location (within about 50 miles), and the exact same container running in our other environment does not exhibit this behavior. Thanks for your help!
These are hard to troubleshoot, but here’s what I’d check:
Are you hitting the configured concurrency limit? You’ll see messages in the logs when this happens, or on the metrics page (run flyctl dashboard metrics to get there).
If your app crashes, you could see something like this while we create a new one. Try running flyctl status --all and see if there are failed instances. You can check the logs for any given instance by running flyctl logs -i <id>
If you post your app name here I’ll have a look and see if I can unearth anything.
We are definitely not hitting any limits. My test is 1 request per second from 1 source. The request takes a about 100ms to complete. So no concurrency there at all. Also, the metrics page shows us well within all tolerances (as do the other metrics that I’ve been able to view in grafana)
The app is not crashing per the logs or the status page.
the app name is “test-identity” (it’s a test ). I can DM you the test command (it’s just a curl) so you can see the behavior for yourself.
I see you upped the RAM, but not the CPUs. Does your app rely on multiple threads doing things at the same time? The VM size you’re on has 1/8th of a CPU
These work well for single-threaded apps / runtimes. They might not work so well for multi-threaded apps. If you’re listening on a separate thread and doing other things (possibly IO-intensive things) from time to time on a different thread, you might not be able to accept in a timely fashion or transfers might be sluggish.
Thanks for the follow up. The app i basically just a squid proxy along with some metrics collection tools. So it’s squid (with the -N flag so it’s just one process) and fluent-bit for log shipping, and collectd for metrics collection and carbon-relay-ng for graphite integrations. None of that is particularly IO intensive.
I’m not sure the behavior “feels” like a sluggish transfer. The curl connection is allowed 2 minutes to timeout, and is the only thing making a request. So there is not a whole lot of contention there. It’s not that the curl is slow to respond - the connection never returns.
@dan It’s worth trying a multi-cpu VM just to see if the behavior improves. We’ve seen some weird stop-the-world blocking with single CPU VMs and multithreaded / multiprocess apps.
You can also bypass our global load balancer and try the VM directly over IPv6. You’ll need to make sure your app is listening on both ipv4 and ipv6, print the FLY_PUBLIC_IP to the logs, and put this in your app config:
[experimental]
allowed_public_ports = [5000]
Then you should be able use https://[ip_addr]:5000 (make sure you wrap the IP in []) to talk directly to the VM.
Oh, no, the one in flyctl info is the global load balanced IP.
I just realized, you can try this over private networking if you setup a Wireguard peer. flyctl ips private will show the private IPs on the instances, you can even ssh to them if you want.
Thanks for the followup @kurt. After running into this all Thursday, I was no longer able to reproduce it on Friday. That sort of thing makes me nervous, so I’m going to run a bunch of tests over the next couple of days to see if I can reproduce it.
I do notice some errors in flyctl logs, both connecting to a local graphite port and connecting to an outside service URL. I have no idea if those are related!
The errors are related to our stats shipping service carbon-relay-ng getting OOM killed. Once the app was showing the same behavior when running in a 1G container as it did when running in a 256Mb container I reduced the size back down to 256Mb. It seems like carbon-relay-ng is leaking memory and then getting killed. I’m looking for an alternate way to ship stats to grafana. But either way, this would not be related to the network timeout issues a squid is not being restarted or OOM killed. I’m looking into using the private wireguard connection to see if it’s some sort of networking thiing.