Hi! I have a Discord bot written in Go, using the Disgo library.
Everything on the receiving end seems to be fine, with regards to gateway events and such from Discord’s API. However, after having been idle for a while, it takes a long time to send any requests to the Discord API (~5 seconds), meaning that the bot is not able to respond to Interactions.
No settings regarding auto start/stop have been set, so it shouldn’t suspend the machine it’s running on, and the logs make no indication of it having been suspended either.
2024/08/21 23:40:49 INFO Interaction approve (application command) received user_id=<removed> guild_id=<removed> channel_id=<removed> delay=37.827499ms
2024/08/21 23:40:49 INFO Entered approvedInnerHandler
2024/08/21 23:40:54 ERROR Failed to defer message. err="10062: Unknown interaction"
The error there is the result of trying to send a HTTP request to Discord’s API. For some reason this is 5 seconds after the previous log line, which is logged immediately before attempting to send the request.
2024-08-22T17:07:03Z app[080165dc699e28] iad [info]2024/08/22 17:07:03 INFO Received PING request
2024-08-22T17:07:03Z app[080165dc699e28] iad [info]2024/08/22 17:07:03 INFO Request completed rtt=239.26639ms
2024-08-22T17:07:04Z app[080165dc699e28] iad [info]2024/08/22 17:07:04 INFO Received PING request
2024-08-22T17:07:04Z app[080165dc699e28] iad [info]2024/08/22 17:07:04 INFO Request completed rtt=238.484879ms
2024-08-22T17:07:06Z app[328715e4b56e58] iad [info]2024/08/22 17:07:06 INFO Received PING request
2024-08-22T17:07:06Z app[328715e4b56e58] iad [info]2024/08/22 17:07:06 INFO Request completed rtt=233.487626ms
2024-08-22T17:07:08Z app[080165dc699e28] iad [info]2024/08/22 17:07:08 INFO Received PING request
2024-08-22T17:07:08Z app[080165dc699e28] iad [info]2024/08/22 17:07:08 INFO Request completed rtt=238.263449ms
2024-08-22T17:07:09Z app[328715e4b56e58] iad [info]2024/08/22 17:07:09 INFO Received PING request
2024-08-22T17:07:09Z app[328715e4b56e58] iad [info]2024/08/22 17:07:09 INFO Request completed rtt=234.442036ms
2024-08-22T17:22:53Z app[080165dc699e28] iad [info]2024/08/22 17:22:53 INFO Received PING request
2024-08-22T17:23:00Z app[080165dc699e28] iad [info]2024/08/22 17:23:00 INFO Request completed rtt=6.515873367s
2024-08-22T17:23:00Z app[080165dc699e28] iad [info]2024/08/22 17:23:00 INFO Received request path=/favicon.ico
2024-08-22T17:23:08Z app[080165dc699e28] iad [info]2024/08/22 17:23:08 INFO Received PING request
2024-08-22T17:23:08Z app[080165dc699e28] iad [info]2024/08/22 17:23:08 INFO Request completed rtt=234.470866ms
2024-08-22T17:24:28Z app[080165dc699e28] iad [info]2024/08/22 17:24:28 INFO Received PING request
2024-08-22T17:24:28Z app[080165dc699e28] iad [info]2024/08/22 17:24:28 INFO Request completed rtt=234.022815ms
2024-08-22T17:24:29Z app[080165dc699e28] iad [info]2024/08/22 17:24:29 INFO Received PING request
2024-08-22T17:24:29Z app[080165dc699e28] iad [info]2024/08/22 17:24:29 INFO Request completed rtt=233.597265ms
After the application has not been interacted with for about 15 minutes, the next HTTP request takes over six seconds. Additional requests take around 235ms.
That is required for the minimally reproducible example. Outbound HTTP request from the app are taking 5+ seconds round-trip after it has been idle. Neither app is autostopping or starting.
This is not regarding the time it takes to send an HTTP request to the app.
interesting… we’ll have to wait for Fly devs to investigate. What happens when you put the app in “suspend”? Does it still have the 5s latency when it wakes up and calls out?
So far haven’t been able to make it take longer than ~1.6s when waking from suspend. Not an alternative for the main app, though, as it has to maintain a websocket connection to receive events from Discord.
(The WS connection with Discord works just fine, but again, if nothing has happened for a while and the app receives an event from it and needs to make an API call to Discord, it takes 5+ seconds.)
~1.5s is how long it takes adressa.no to respond. So it seems fly is deprioritizing idle apps and when it does get requests, the handshake doesn’t have enough compute to shake in a timely manner. I’m purely guessing, but it’s a similar problem on AWS: Reference tweet
Could definitely be something like that! Also wow, that tweet is very recent.
Guess something like AWS Lambda is a bit different compared to something that is permanently running, though? But could still be something that it has in common.
Anyway, idk if we’ll get to the bottom of this w/ just the two of us, but I got to say that I appreciate the responses from you Thanks!
Even though your machine is permanently running Fly does manage the resources for the shared cpus. It likely is doing some resource orchestration that would cause this behavior. SSL/TLS is mostly CPU bound
@myrkvi look at your app’s grafana logs for the TLS handshake time. It’s probably spikey during that time.
I think I might have explained this badly, but as previously mentioned, my app is a Discord bot. It is not listening for incoming connections, and is not acting as a server in any capacity. The application is acting as a client, and establishes a WebSocket connection with Discord’s Gateway server to receive live events. It then, as a client, sends HTTP requests to the Discord REST API.
The measure of time taken for the minimally reproducible app is also for it acting as a client and performing a request. It is not a measure of it responding to an incoming HTTP request. However, it does have an HTTP server which is used to make it perform its HTTP request, and does have the aforementioned Fly Edge dashboard.
But since my main app does not have an HTTP server, it has no data for Fly Edge.
My minimal reproducible example does have TLS analytics because I am serving an endpoint to be able to trigger it acting as a client. These are for connections made to the app where the app is acting as a server.