Outgoing requests slow after idling (not suspending)

Hi! I have a Discord bot written in Go, using the Disgo library.

Everything on the receiving end seems to be fine, with regards to gateway events and such from Discord’s API. However, after having been idle for a while, it takes a long time to send any requests to the Discord API (~5 seconds), meaning that the bot is not able to respond to Interactions.

No settings regarding auto start/stop have been set, so it shouldn’t suspend the machine it’s running on, and the logs make no indication of it having been suspended either.

This existing topic seems related, but has no resolution: Outgoing request timeouts after idle time - Questions / Help - Fly.io

2024/08/21 23:40:49 INFO Interaction approve (application command) received user_id=<removed> guild_id=<removed> channel_id=<removed> delay=37.827499ms
2024/08/21 23:40:49 INFO Entered approvedInnerHandler
2024/08/21 23:40:54 ERROR Failed to defer message. err="10062: Unknown interaction"

The error there is the result of trying to send a HTTP request to Discord’s API. For some reason this is 5 seconds after the previous log line, which is logged immediately before attempting to send the request.

I am able to reproduce this with a minimal example too.

Go code for minimal example
package main

import (
	"fmt"
	"log/slog"
	"net/http"
	"time"
)

func main() {
	http.HandleFunc("/", indexHandler)
	http.HandleFunc("/ping", pingHandler)

	http.ListenAndServe(":8080", nil)
}

func indexHandler(w http.ResponseWriter, r *http.Request) {
	slog.Info("Received request",
		"path", r.URL.Path,
	)
	fmt.Fprintln(w, "Hello, world!")
}

func pingHandler(w http.ResponseWriter, r *http.Request) {
	slog.Info("Received PING request")

	before := time.Now()
	_, err := http.Get("https://adressa.no")
	rtt := time.Since(before)

	if err != nil {
		slog.Error("Error while pinging", "err", err)
		http.Error(w, "Error while pinging", http.StatusInternalServerError)
		return
	}

	slog.Info("Request completed", "rtt", rtt)

	fmt.Fprintf(w, "PONG - took %s", rtt)
}
Logs
2024-08-22T17:07:03Z app[080165dc699e28] iad [info]2024/08/22 17:07:03 INFO Received PING request
2024-08-22T17:07:03Z app[080165dc699e28] iad [info]2024/08/22 17:07:03 INFO Request completed rtt=239.26639ms
2024-08-22T17:07:04Z app[080165dc699e28] iad [info]2024/08/22 17:07:04 INFO Received PING request
2024-08-22T17:07:04Z app[080165dc699e28] iad [info]2024/08/22 17:07:04 INFO Request completed rtt=238.484879ms
2024-08-22T17:07:06Z app[328715e4b56e58] iad [info]2024/08/22 17:07:06 INFO Received PING request
2024-08-22T17:07:06Z app[328715e4b56e58] iad [info]2024/08/22 17:07:06 INFO Request completed rtt=233.487626ms
2024-08-22T17:07:08Z app[080165dc699e28] iad [info]2024/08/22 17:07:08 INFO Received PING request
2024-08-22T17:07:08Z app[080165dc699e28] iad [info]2024/08/22 17:07:08 INFO Request completed rtt=238.263449ms
2024-08-22T17:07:09Z app[328715e4b56e58] iad [info]2024/08/22 17:07:09 INFO Received PING request
2024-08-22T17:07:09Z app[328715e4b56e58] iad [info]2024/08/22 17:07:09 INFO Request completed rtt=234.442036ms
2024-08-22T17:22:53Z app[080165dc699e28] iad [info]2024/08/22 17:22:53 INFO Received PING request
2024-08-22T17:23:00Z app[080165dc699e28] iad [info]2024/08/22 17:23:00 INFO Request completed rtt=6.515873367s
2024-08-22T17:23:00Z app[080165dc699e28] iad [info]2024/08/22 17:23:00 INFO Received request path=/favicon.ico
2024-08-22T17:23:08Z app[080165dc699e28] iad [info]2024/08/22 17:23:08 INFO Received PING request
2024-08-22T17:23:08Z app[080165dc699e28] iad [info]2024/08/22 17:23:08 INFO Request completed rtt=234.470866ms
2024-08-22T17:24:28Z app[080165dc699e28] iad [info]2024/08/22 17:24:28 INFO Received PING request
2024-08-22T17:24:28Z app[080165dc699e28] iad [info]2024/08/22 17:24:28 INFO Request completed rtt=234.022815ms
2024-08-22T17:24:29Z app[080165dc699e28] iad [info]2024/08/22 17:24:29 INFO Received PING request
2024-08-22T17:24:29Z app[080165dc699e28] iad [info]2024/08/22 17:24:29 INFO Request completed rtt=233.597265ms

After the application has not been interacted with for about 15 minutes, the next HTTP request takes over six seconds. Additional requests take around 235ms.

Curious if you’re using the 256MB shared 1x cpu setup. If so, I think we’re 3rd class citizens in Fly’s infra.

I’m using a shared-cpu-2x@512MB for my main application, but for the minimally reproducible example it’s a shared-cpu-1x@256MB, yes.

I don’t mind having the stuff I pay next to nothing for getting a relatively low priority, but 6s for a HTTP request seems a bit drastic.

The min repro has this that may or may not affect things. Does removing still cause the 5s latency?

That is required for the minimally reproducible example. Outbound HTTP request from the app are taking 5+ seconds round-trip after it has been idle. Neither app is autostopping or starting.

This is not regarding the time it takes to send an HTTP request to the app.

interesting… we’ll have to wait for Fly devs to investigate. What happens when you put the app in “suspend”? Does it still have the 5s latency when it wakes up and calls out?

So far haven’t been able to make it take longer than ~1.6s when waking from suspend. Not an alternative for the main app, though, as it has to maintain a websocket connection to receive events from Discord.

(The WS connection with Discord works just fine, but again, if nothing has happened for a while and the app receives an event from it and needs to make an API call to Discord, it takes 5+ seconds.)

~1.5s is how long it takes adressa.no to respond. So it seems fly is deprioritizing idle apps and when it does get requests, the handshake doesn’t have enough compute to shake in a timely manner. I’m purely guessing, but it’s a similar problem on AWS: Reference tweet

Could definitely be something like that! Also wow, that tweet is very recent. :stuck_out_tongue:

Guess something like AWS Lambda is a bit different compared to something that is permanently running, though? But could still be something that it has in common.

Anyway, idk if we’ll get to the bottom of this w/ just the two of us, but I got to say that I appreciate the responses from you :slight_smile: Thanks!

Even though your machine is permanently running Fly does manage the resources for the shared cpus. It likely is doing some resource orchestration that would cause this behavior. SSL/TLS is mostly CPU bound :person_shrugging:

@myrkvi look at your app’s grafana logs for the TLS handshake time. It’s probably spikey during that time.

Ah, the TLS issue is known: Infra Log

I’m not sure where to look for this. Seems there are some metrics for TLS, but no data. Assuming this is for inbound traffic to the app.

Go to your app’s grafana dashboard and click on Fly Edge > TLS

The Fly Edge dashboard is not available for that app. The only app I get is a different one which is receiving inbound HTTP traffic (i.e. as a server)

Weird :couch_and_lamp:

Here’s mine:
image

URL looks like: https://fly-metrics.net/d/fly-edge/fly-edge?from=now-3h&to=now&var-app=<app-name>&orgId=<org-id>

I think I might have explained this badly, but as previously mentioned, my app is a Discord bot. It is not listening for incoming connections, and is not acting as a server in any capacity. The application is acting as a client, and establishes a WebSocket connection with Discord’s Gateway server to receive live events. It then, as a client, sends HTTP requests to the Discord REST API.

The measure of time taken for the minimally reproducible app is also for it acting as a client and performing a request. It is not a measure of it responding to an incoming HTTP request. However, it does have an HTTP server which is used to make it perform its HTTP request, and does have the aforementioned Fly Edge dashboard.

But since my main app does not have an HTTP server, it has no data for Fly Edge.

Does your go reproduction app not have TLS analytics? You should be able to see the TLS handshake time to confirm that that was the issue.

My minimal reproducible example does have TLS analytics because I am serving an endpoint to be able to trigger it acting as a client. These are for connections made to the app where the app is acting as a server.

For the minimally reproducible example, it looks like this:
image

For my main app, it looks like this:
image

Both are for last 2 days.

But again, this is as expected.