Outgoing request timeouts after idle time

I’ve recently deployed a Go application that sends an HTTP request to the external resource each time a specific endpoint is called. I’ve noticed that after I leave the app running idle for about 10 minutes the next HTTP request always takes about 6s to complete and times out. Every subsequent request completes in < 1s but when I leave the app idle again for 10 min the problem repeats iteself.

Here is a minimal reproducer in Go:

main.go

package main

import (
	"github.com/gofiber/fiber/v2"
	"log/slog"
	"os"
)

func main() {
	app := fiber.New()

	app.Get("/test", func(c *fiber.Ctx) error {
		agent := fiber.Get("https://httpbin.org/get")
		statusCode, body, errs := agent.Bytes()
		if len(errs) > 0 {
			return c.Status(fiber.StatusInternalServerError).JSON(fiber.Map{
				"errs": errs,
			})
		}

		return c.Status(statusCode).Send(body)
	})

	if err := app.Listen(":8080"); err != nil {
		slog.Error("Listen() Error", "err", err)
		os.Exit(1)
	}
}

Dockerfile

FROM gcr.io/distroless/static-debian12:nonroot

WORKDIR /
USER nonroot
EXPOSE 8080

COPY --chown=nonroot:nonroot ./fly-io-bug /fly-io-bug

CMD ["/fly-io-bug"]

fly.toml

app = 'fly-io-timeout'
primary_region = 'ams'
kill_signal = 'SIGINT'
kill_timeout = '20s'

[build]

[http_service]
  internal_port = 8080
  force_https = true

[[vm]]
  memory = '256'
  cpu_kind = 'shared'
  cpus = 1

After leaving the app running for 10 minutes I do:

curl https://fly-io-timeout.fly.dev/test

And the response is

{"errs":[{"Err":"i/o timeout","Name":"httpbin.org","Server":"","IsTimeout":true,"IsTemporary":false,"IsNotFound":false}]}

I’ve deployed the same app to other cloud providers (namely AWS and Hetzner) and the problem doesn’t exist there so I’m assuming it has to do with the Fly.io network.

That sounds like the machines are auto stopping, do you see “App has excess capacity” in your logs?

I don’t think this is the case here. Auto-stop is disabled by default in the config and the machine appears to be running in the dashboard. Also the error I get is a specific application error which means the request has been accepted by the application and a timeout happens after that.

hmm strange, the conditions sounds like it’s auto_stopping :person_shrugging:

Yeah, it would auto resume and accept requests after the coldstart that you’re experiencing (6s).

Humor us and explicitly set the autostop/start (update your flyctl to latest version):

auto_stop_machines = "suspend"
auto_start_machines = true

With

auto_stop_machines = "suspend"
auto_start_machines = true

added the machine indeed gets suspended

 2024-07-27T15:33:51.625 proxy[178190eb263108] ams [info] App fly-io-timeout has excess capacity, autosuspending machine 178190eb263108. 0 out of 1 machines left running (region=ams, process group=app)
2024-07-27T15:33:52.426 app[178190eb263108] ams [info] Virtual machine has been suspended 

and automatically starts after a request is made (in about 3s), and there is no timeout when sending a request to external resource.

However this behavior is completely different then how it was before. With the default values (auto_stop_machines = “off”) there is no log message about stopping the machine, the machine icon remains green but when a request is made the network call still takes about 6s to complete. It looks like an app was running but its network stack/proxy was down.

I don’t want to suspend the machine, I’d like to keep it running all the time but still get that instant response when needed.

Correction - I’ve tried a few times and managed to make the request timeout when the app woke up from suspension. I mean the app was suspended, I sent a request and after 6s got

{"errs":[{"Err":"i/o timeout","Name":"httpbin.org","Server":"","IsTimeout":true,"IsTemporary":false,"IsNotFound":false}]}

so this might be independent from the actual app state. Seems more like an internal infrastructure issue to me.

Interesting, there’s definitely something off, it shouldn’t take 3 seconds after resuming a suspend, it’s pretty quick (100-200ms). Try deploying to another region? Maybe there’s infra issues in ams.

I’ve tried waw yesterday with the same effect. If this is region-specific issue - it for sure happens in at least those 2.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.