Outgoing request timeouts after idle time

mkorman · July 27, 2024, 11:17am

I’ve recently deployed a Go application that sends an HTTP request to the external resource each time a specific endpoint is called. I’ve noticed that after I leave the app running idle for about 10 minutes the next HTTP request always takes about 6s to complete and times out. Every subsequent request completes in < 1s but when I leave the app idle again for 10 min the problem repeats iteself.

Here is a minimal reproducer in Go:

main.go

package main

import (
	"github.com/gofiber/fiber/v2"
	"log/slog"
	"os"
)

func main() {
	app := fiber.New()

	app.Get("/test", func(c *fiber.Ctx) error {
		agent := fiber.Get("https://httpbin.org/get")
		statusCode, body, errs := agent.Bytes()
		if len(errs) > 0 {
			return c.Status(fiber.StatusInternalServerError).JSON(fiber.Map{
				"errs": errs,
			})
		}

		return c.Status(statusCode).Send(body)
	})

	if err := app.Listen(":8080"); err != nil {
		slog.Error("Listen() Error", "err", err)
		os.Exit(1)
	}
}

Dockerfile

FROM gcr.io/distroless/static-debian12:nonroot

WORKDIR /
USER nonroot
EXPOSE 8080

COPY --chown=nonroot:nonroot ./fly-io-bug /fly-io-bug

CMD ["/fly-io-bug"]

fly.toml

app = 'fly-io-timeout'
primary_region = 'ams'
kill_signal = 'SIGINT'
kill_timeout = '20s'

[build]

[http_service]
  internal_port = 8080
  force_https = true

[[vm]]
  memory = '256'
  cpu_kind = 'shared'
  cpus = 1

After leaving the app running for 10 minutes I do:

curl https://fly-io-timeout.fly.dev/test

And the response is

{"errs":[{"Err":"i/o timeout","Name":"httpbin.org","Server":"","IsTimeout":true,"IsTemporary":false,"IsNotFound":false}]}

I’ve deployed the same app to other cloud providers (namely AWS and Hetzner) and the problem doesn’t exist there so I’m assuming it has to do with the Fly.io network.

khuezy · July 27, 2024, 2:34pm

That sounds like the machines are auto stopping, do you see “App has excess capacity” in your logs?

mkorman · July 27, 2024, 2:41pm

I don’t think this is the case here. Auto-stop is disabled by default in the config and the machine appears to be running in the dashboard. Also the error I get is a specific application error which means the request has been accepted by the application and a timeout happens after that.

khuezy · July 27, 2024, 2:51pm

hmm strange, the conditions sounds like it’s auto_stopping

Yeah, it would auto resume and accept requests after the coldstart that you’re experiencing (6s).

Humor us and explicitly set the autostop/start (update your flyctl to latest version):

auto_stop_machines = "suspend"
auto_start_machines = true

mkorman · July 27, 2024, 3:45pm

With

auto_stop_machines = "suspend"
auto_start_machines = true

added the machine indeed gets suspended

 2024-07-27T15:33:51.625 proxy[178190eb263108] ams [info] App fly-io-timeout has excess capacity, autosuspending machine 178190eb263108. 0 out of 1 machines left running (region=ams, process group=app)
2024-07-27T15:33:52.426 app[178190eb263108] ams [info] Virtual machine has been suspended

and automatically starts after a request is made (in about 3s), and there is no timeout when sending a request to external resource.

However this behavior is completely different then how it was before. With the default values (auto_stop_machines = “off”) there is no log message about stopping the machine, the machine icon remains green but when a request is made the network call still takes about 6s to complete. It looks like an app was running but its network stack/proxy was down.

I don’t want to suspend the machine, I’d like to keep it running all the time but still get that instant response when needed.

mkorman · July 27, 2024, 3:52pm

Correction - I’ve tried a few times and managed to make the request timeout when the app woke up from suspension. I mean the app was suspended, I sent a request and after 6s got

{"errs":[{"Err":"i/o timeout","Name":"httpbin.org","Server":"","IsTimeout":true,"IsTemporary":false,"IsNotFound":false}]}

so this might be independent from the actual app state. Seems more like an internal infrastructure issue to me.

khuezy · July 27, 2024, 3:59pm

Interesting, there’s definitely something off, it shouldn’t take 3 seconds after resuming a suspend, it’s pretty quick (100-200ms). Try deploying to another region? Maybe there’s infra issues in ams.

mkorman · July 27, 2024, 4:11pm

I’ve tried waw yesterday with the same effect. If this is region-specific issue - it for sure happens in at least those 2.

system · August 3, 2024, 4:12pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Increasing idle timeout Questions / Help	2	761	September 23, 2021
Request timeouts on fly.io Questions / Help	10	3435	May 19, 2023
Is downtime expected post app deploy? Questions / Help	9	1192	February 2, 2022
Outgoing requests slow after idling (not suspending) Questions / Help machines	21	69	August 25, 2024
Request timeout at 30s, can I increase it? Questions / Help	2	1073	May 19, 2023

Outgoing request timeouts after idle time

Related topics